docs: add use docling (#150)

--------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-12-08 20:58:11 +00:00 · 2024-10-17 18:14:48 +02:00
parent 24f949ada2
commit 61c092f445
10 changed files with 271 additions and 63 deletions
--- a/docs/concepts/docling_document.md
+++ b/docs/concepts/docling_document.md
@@ -1,4 +1,4 @@
-With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a 
+With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
 pydantic datatype, which can express several features common to documents, such as:

 * Text, Tables, Pictures, and more
@@ -9,15 +9,16 @@ pydantic datatype, which can express several features common to documents, such

 It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.

-# Example document structures
+## Example document structures

-To illustrate the features of the `DoclingDocument` format, consider the following side-by-side comparison of a
-`DoclingDocument` converted from `test/data/word_sample.docx`. Left side shows snippets from the converted document 
-serialized as YAML, right side shows the corresponding visual parts in MS Word.
+To illustrate the features of the `DoclingDocument` format, in the subsections below we consider the
+`DoclingDocument` converted from `tests/data/word_sample.docx` and we present some side-by-side comparisons,
+where the left side shows snippets from the converted document
+serialized as YAML and the right one shows the corresponding parts of the original MS Word.

-## Basic structure
+### Basic structure

-A `DoclingDocument` exposes top-level fields for the document content, organized in two categories. 
+A `DoclingDocument` exposes top-level fields for the document content, organized in two categories.
 The first category is the _content items_, which are stored in these fields:

 - `texts`: All items that have a text representation (paragraph, section heading, equation, ...). Base class is `TextItem`.
@@ -34,32 +35,34 @@ The second category is _content structure_, which is encapsualted in:
 - `furniture`: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)
 - `groups`: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)

-All of the above fields are only storing `NodeItem` instances, which reference children and parents 
-through JSON pointers. 
+All of the above fields are only storing `NodeItem` instances, which reference children and parents
+through JSON pointers.

 The reading order of the document is encapsulated through the `body` tree and the order of _children_ in each item
 in the tree.

-Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`). 
+Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`).

 ![doc_hierarchy_1](../assets/docling_doc_hierarchy_1.png)

-## Grouping
+### Grouping

 Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of
-"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the 
+"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the
 top-level `groups` field.

 ![doc_hierarchy_2](../assets/docling_doc_hierarchy_2.png)

-## Tables
+<!--
+### Tables

 TBD

-## Pictures
+### Pictures

 TBD

-## Provenance
+### Provenance

-TBD
+TBD
+ -->
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -0,0 +1,171 @@
+## Conversion
+
+### Convert a single document
+
+To convert invidual PDF documents, use `convert()`, for example:
+
+```python
+from docling.document_converter import DocumentConverter
+
+source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
+converter = DocumentConverter()
+result = converter.convert(source)
+print(result.document.export_to_markdown())  # output: "### Docling Technical Report[...]"
+```
+
+### CLI
+
+You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
+
+A simple example would look like this:
+```console
+docling https://arxiv.org/pdf/2206.01062
+```
+
+To see all available options (export formats etc.) run `docling --help`.
+
+<details>
+  <summary><b>CLI reference</b></summary>
+
+  Here are the available options as of this writing (for an up-to-date listing, run `docling --help`):
+
+  ```console
+  $ docling --help
+
+ Usage: docling [OPTIONS] source
+
+╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None]         │
+│                                 [required]                                                                                │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ --from                                     [docx|pptx|html|image|pdf]         Specify input formats to convert from.      │
+│                                                                               Defaults to all formats.                    │
+│                                                                               [default: None]                             │
+│ --to                                       [md|json|text|doctags]             Specify output formats. Defaults to         │
+│                                                                               Markdown.                                   │
+│                                                                               [default: None]                             │
+│ --ocr               --no-ocr                                                  If enabled, the bitmap content will be      │
+│                                                                               processed using OCR.                        │
+│                                                                               [default: ocr]                              │
+│ --ocr-engine                               [easyocr|tesseract_cli|tesseract]  The OCR engine to use. [default: easyocr]   │
+│ --abort-on-error    --no-abort-on-error                                       If enabled, the bitmap content will be      │
+│                                                                               processed using OCR.                        │
+│                                                                               [default: no-abort-on-error]                │
+│ --output                                   PATH                               Output directory where results are saved.   │
+│                                                                               [default: .]                                │
+│ --version                                                                     Show version information.                   │
+│ --help                                                                        Show this message and exit.                 │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+  ```
+</details>
+
+
+
+### Advanced options
+
+#### Adjust pipeline features
+
+The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
+one can adjust the conversion pipeline and features.
+
+
+##### Control PDF table extraction options
+
+You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
+This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
+
+
+```python
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+
+pipeline_options = PdfPipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.do_cell_matching = False  # uses text cells predicted from table structure model
+
+doc_converter = DocumentConverter(
+    format_options={
+        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+    }
+)
+```
+
+Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
+
+```python
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
+
+pipeline_options = PdfPipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model
+
+doc_converter = DocumentConverter(
+    format_options={
+        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+    }
+)
+```
+
+#### Impose limits on the document size
+
+You can limit the file size and number of pages which should be allowed to process per document:
+
+```python
+from pathlib import Path
+from docling.document_converter import DocumentConverter
+
+source = "https://arxiv.org/pdf/2408.09869"
+converter = DocumentConverter()
+result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
+```
+
+#### Convert from binary PDF streams
+
+You can convert PDFs from a binary stream instead of from the filesystem as follows:
+
+```python
+from io import BytesIO
+from docling.datamodel.base_models import DocumentStream
+from docling.document_converter import DocumentConverter
+
+buf = BytesIO(your_binary_stream)
+source = DocumentStream(filename="my_doc.pdf", stream=buf)
+converter = DocumentConverter()
+result = converter.convert(source)
+```
+
+#### Limit resource usage
+
+You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
+
+
+## Chunking
+
+You can perform a hierarchy-aware chunking of a Docling document as follows:
+
+```python
+from docling.document_converter import DocumentConverter
+from docling_core.transforms.chunker import HierarchicalChunker
+
+conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
+doc = conv_res.document
+chunks = list(HierarchicalChunker().chunk(doc))
+
+print(chunks[30])
+# {
+#   "text": "Lately, new types of ML models for document-layout analysis have emerged [...]",
+#   "meta": {
+#     "doc_items": [{
+#       "self_ref": "#/texts/40",
+#       "label": "text",
+#       "prov": [{
+#         "page_no": 2,
+#         "bbox": {"l": 317.06, "t": 325.81, "r": 559.18, "b": 239.97, ...},
+#       }]
+#     }],
+#     "headings": ["2 RELATED WORK"],
+#   }
+# }
+```
--- a/docs/v2.md
+++ b/docs/v2.md
@@ -2,7 +2,7 @@

 Docling v2 introduces several new features:

- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats 
+- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
 - Produces a new, universal document representation which can encapsulate document hierarchy
 - Comes with a fresh new API and CLI

@@ -22,7 +22,7 @@ docling myfile.pdf --to json --to md --no-ocr
 docling ./input/dir --from pdf

 # Convert PDF and Word files in input directory to Markdown and JSON
-docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch  
+docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch

 # Convert all supported files in input directory to Markdown, but abort on first error
 docling ./input/dir --output ./scratch --abort-on-error
@@ -38,8 +38,8 @@ docling ./input/dir --output ./scratch --abort-on-error
 ### Setting up a `DocumentConverter`

 To accomodate many input formats, we changed the way you need to set up your `DocumentConverter` object.
-You can now define a list of allowed formats on the `DocumentConverter` initialization, and specify custom options 
-per-format if desired. By default, all supported formats are allowed. If you don't provide `format_options`, defaults 
+You can now define a list of allowed formats on the `DocumentConverter` initialization, and specify custom options
+per-format if desired. By default, all supported formats are allowed. If you don't provide `format_options`, defaults
 will be used for all `allowed_formats`.

 Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend.
@@ -59,7 +59,7 @@ from docling.datamodel.pipeline_options import PdfPipelineOptions
 from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

 ## Default initialization still works as before:
-# doc_converter = DocumentConverter() 
+# doc_converter = DocumentConverter()


 # previous `PipelineOptions` is now `PdfPipelineOptions`
@@ -68,7 +68,7 @@ pipeline_options.do_ocr = False
 pipeline_options.do_table_structure = True
 #...

-## Custom options are now defined per format. 
+## Custom options are now defined per format.
 doc_converter = (
    DocumentConverter(  # all of the below is optional, has internal defaults.
        allowed_formats=[
@@ -100,8 +100,8 @@ More options are shown in the following example units:

 ### Converting documents

-We have simplified the way you can feed input to the `DocumentConverter` and renamed the conversion methods for 
-better semantics. You can now call the conversion directly with a single file, or a list of input files, 
+We have simplified the way you can feed input to the `DocumentConverter` and renamed the conversion methods for
+better semantics. You can now call the conversion directly with a single file, or a list of input files,
 or `DocumentStream` objects, without constructing a `DocumentConversionInput` object first.

 * `DocumentConverter.convert` now converts a single file input (previously `DocumentConverter.convert_single`).
@@ -129,7 +129,7 @@ input_files = [
 conv_results_iter = doc_converter.convert_all(input_files) # previously `convert_batch`

 ```
-Through the `raises_on_error` argument, you can also control if the conversion should raise exceptions when first 
+Through the `raises_on_error` argument, you can also control if the conversion should raise exceptions when first
 encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status.
 By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).

@@ -139,7 +139,7 @@ conv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False

 ```

-### Access document structures 
+### Access document structures

 We have simplified how you can access and export the converted document data, too. Our universal document representation
 is now available in conversion results as a `DoclingDocument` object.
@@ -167,7 +167,7 @@ for item, level in conv_result.document.iterate_items:
 conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type
 ```

-## Export into JSON, Markdown, Doctags
+### Export into JSON, Markdown, Doctags
 **Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
 and are now available on `DoclingDocument` as:

@@ -184,7 +184,7 @@ print(conv_res.document.export_to_markdown())
 print(conv_res.document.export_to_document_tokens())
 ```

-**Note**: While it is deprecated, you can _still_ export Docling v1 JSON format. This is available through the same 
+**Note**: While it is deprecated, you can _still_ export Docling v1 JSON format. This is available through the same
 methods as on the `DoclingDocument` type:
 ```shell
 ## Export legacy document representation to desired format, for v1 compatibility:
@@ -193,7 +193,7 @@ print(conv_res.legacy_document.export_to_markdown())
 print(conv_res.legacy_document.export_to_document_tokens())
 ```

-## Reload a `DoclingDocument` stored as JSON
+### Reload a `DoclingDocument` stored as JSON

 You can save and reload a `DoclingDocument` to disk in JSON format using the following codes:

@@ -211,3 +211,19 @@ with Path("./doc.json").open("r") as fp:

 ```

+### Chunking
+
+Docling v2 defines new base classes for chunking:
+
+- `BaseMeta` for chunk metadata
+- `BaseChunk` containing the chunk text and metadata, and
+- `BaseChunker` for chunkers, producing chunks out of a `DoclingDocument`.
+
+Additionally, it provides an updated `HierarchicalChunker` implementation, which
+leverages the new `DoclingDocument` and provides a new, richer chunk output format, including:
+
+- the respective doc items for grounding
+- any applicable headings for context
+- any applicable captions for context
+
+For an example, check out [Chunking usage](../usage/#chunking).