mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-27 04:24:45 +00:00
doc refinements
[skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
parent
640c7d0c9f
commit
3ec2d9479f
@ -53,7 +53,6 @@ source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
|||||||
converter = DocumentConverter()
|
converter = DocumentConverter()
|
||||||
result = converter.convert(source)
|
result = converter.convert(source)
|
||||||
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
||||||
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
|
With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
|
||||||
pydantic datatype, which can express several features common to documents, such as:
|
pydantic datatype, which can express several features common to documents, such as:
|
||||||
|
|
||||||
* Text, Tables, Pictures, and more
|
* Text, Tables, Pictures, and more
|
||||||
@ -9,15 +9,15 @@ pydantic datatype, which can express several features common to documents, such
|
|||||||
|
|
||||||
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
|
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
|
||||||
|
|
||||||
# Example document structures
|
## Example document structures
|
||||||
|
|
||||||
To illustrate the features of the `DoclingDocument` format, consider the following side-by-side comparison of a
|
To illustrate the features of the `DoclingDocument` format, consider the following side-by-side comparison of a
|
||||||
`DoclingDocument` converted from `test/data/word_sample.docx`. Left side shows snippets from the converted document
|
`DoclingDocument` converted from `test/data/word_sample.docx`. Left side shows snippets from the converted document
|
||||||
serialized as YAML, right side shows the corresponding visual parts in MS Word.
|
serialized as YAML, right side shows the corresponding visual parts in MS Word.
|
||||||
|
|
||||||
## Basic structure
|
### Basic structure
|
||||||
|
|
||||||
A `DoclingDocument` exposes top-level fields for the document content, organized in two categories.
|
A `DoclingDocument` exposes top-level fields for the document content, organized in two categories.
|
||||||
The first category is the _content items_, which are stored in these fields:
|
The first category is the _content items_, which are stored in these fields:
|
||||||
|
|
||||||
- `texts`: All items that have a text representation (paragraph, section heading, equation, ...). Base class is `TextItem`.
|
- `texts`: All items that have a text representation (paragraph, section heading, equation, ...). Base class is `TextItem`.
|
||||||
@ -34,32 +34,32 @@ The second category is _content structure_, which is encapsualted in:
|
|||||||
- `furniture`: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)
|
- `furniture`: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)
|
||||||
- `groups`: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)
|
- `groups`: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)
|
||||||
|
|
||||||
All of the above fields are only storing `NodeItem` instances, which reference children and parents
|
All of the above fields are only storing `NodeItem` instances, which reference children and parents
|
||||||
through JSON pointers.
|
through JSON pointers.
|
||||||
|
|
||||||
The reading order of the document is encapsulated through the `body` tree and the order of _children_ in each item
|
The reading order of the document is encapsulated through the `body` tree and the order of _children_ in each item
|
||||||
in the tree.
|
in the tree.
|
||||||
|
|
||||||
Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`).
|
Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`).
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
## Grouping
|
### Grouping
|
||||||
|
|
||||||
Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of
|
Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of
|
||||||
"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the
|
"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the
|
||||||
top-level `groups` field.
|
top-level `groups` field.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
## Tables
|
### Tables
|
||||||
|
|
||||||
TBD
|
TBD
|
||||||
|
|
||||||
## Pictures
|
### Pictures
|
||||||
|
|
||||||
TBD
|
TBD
|
||||||
|
|
||||||
## Provenance
|
### Provenance
|
||||||
|
|
||||||
TBD
|
TBD
|
||||||
|
@ -1,4 +1,6 @@
|
|||||||
## Convert a single document
|
## Conversion
|
||||||
|
|
||||||
|
### Convert a single document
|
||||||
|
|
||||||
To convert invidual PDF documents, use `convert()`, for example:
|
To convert invidual PDF documents, use `convert()`, for example:
|
||||||
|
|
||||||
@ -8,11 +10,10 @@ from docling.document_converter import DocumentConverter
|
|||||||
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
||||||
converter = DocumentConverter()
|
converter = DocumentConverter()
|
||||||
result = converter.convert(source)
|
result = converter.convert(source)
|
||||||
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"
|
||||||
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## CLI
|
### CLI
|
||||||
|
|
||||||
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
|
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
|
||||||
|
|
||||||
@ -31,8 +32,8 @@ To see all available options (export formats etc.) run `docling --help`.
|
|||||||
```console
|
```console
|
||||||
$ docling --help
|
$ docling --help
|
||||||
|
|
||||||
Usage: docling [OPTIONS] source
|
Usage: docling [OPTIONS] source
|
||||||
|
|
||||||
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||||
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] │
|
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] │
|
||||||
│ [required] │
|
│ [required] │
|
||||||
@ -61,15 +62,15 @@ To see all available options (export formats etc.) run `docling --help`.
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Advanced options
|
### Advanced options
|
||||||
|
|
||||||
### Adjust pipeline features
|
#### Adjust pipeline features
|
||||||
|
|
||||||
The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
|
The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
|
||||||
one can adjust the conversion pipeline and features.
|
one can adjust the conversion pipeline and features.
|
||||||
|
|
||||||
|
|
||||||
#### Control PDF table extraction options
|
##### Control PDF table extraction options
|
||||||
|
|
||||||
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
|
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
|
||||||
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
|
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
|
||||||
@ -107,7 +108,7 @@ doc_converter = DocumentConverter(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Impose limits on the document size
|
#### Impose limits on the document size
|
||||||
|
|
||||||
You can limit the file size and number of pages which should be allowed to process per document:
|
You can limit the file size and number of pages which should be allowed to process per document:
|
||||||
|
|
||||||
@ -120,7 +121,7 @@ converter = DocumentConverter()
|
|||||||
result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
|
result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Convert from binary PDF streams
|
#### Convert from binary PDF streams
|
||||||
|
|
||||||
You can convert PDFs from a binary stream instead of from the filesystem as follows:
|
You can convert PDFs from a binary stream instead of from the filesystem as follows:
|
||||||
|
|
||||||
@ -135,12 +136,12 @@ converter = DocumentConverter()
|
|||||||
result = converter.convert(source)
|
result = converter.convert(source)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Limit resource usage
|
#### Limit resource usage
|
||||||
|
|
||||||
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
||||||
|
|
||||||
|
|
||||||
### Chunking
|
## Chunking
|
||||||
|
|
||||||
You can perform a hierarchy-aware chunking of a Docling document as follows:
|
You can perform a hierarchy-aware chunking of a Docling document as follows:
|
||||||
|
|
42
docs/v2.md
42
docs/v2.md
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
Docling v2 introduces several new features:
|
Docling v2 introduces several new features:
|
||||||
|
|
||||||
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
|
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
|
||||||
- Produces a new, universal document representation which can encapsulate document hierarchy
|
- Produces a new, universal document representation which can encapsulate document hierarchy
|
||||||
- Comes with a fresh new API and CLI
|
- Comes with a fresh new API and CLI
|
||||||
|
|
||||||
@ -22,7 +22,7 @@ docling myfile.pdf --to json --to md --no-ocr
|
|||||||
docling ./input/dir --from pdf
|
docling ./input/dir --from pdf
|
||||||
|
|
||||||
# Convert PDF and Word files in input directory to Markdown and JSON
|
# Convert PDF and Word files in input directory to Markdown and JSON
|
||||||
docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch
|
docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch
|
||||||
|
|
||||||
# Convert all supported files in input directory to Markdown, but abort on first error
|
# Convert all supported files in input directory to Markdown, but abort on first error
|
||||||
docling ./input/dir --output ./scratch --abort-on-error
|
docling ./input/dir --output ./scratch --abort-on-error
|
||||||
@ -38,8 +38,8 @@ docling ./input/dir --output ./scratch --abort-on-error
|
|||||||
### Setting up a `DocumentConverter`
|
### Setting up a `DocumentConverter`
|
||||||
|
|
||||||
To accomodate many input formats, we changed the way you need to set up your `DocumentConverter` object.
|
To accomodate many input formats, we changed the way you need to set up your `DocumentConverter` object.
|
||||||
You can now define a list of allowed formats on the `DocumentConverter` initialization, and specify custom options
|
You can now define a list of allowed formats on the `DocumentConverter` initialization, and specify custom options
|
||||||
per-format if desired. By default, all supported formats are allowed. If you don't provide `format_options`, defaults
|
per-format if desired. By default, all supported formats are allowed. If you don't provide `format_options`, defaults
|
||||||
will be used for all `allowed_formats`.
|
will be used for all `allowed_formats`.
|
||||||
|
|
||||||
Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend.
|
Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend.
|
||||||
@ -59,7 +59,7 @@ from docling.datamodel.pipeline_options import PdfPipelineOptions
|
|||||||
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
|
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
|
||||||
|
|
||||||
## Default initialization still works as before:
|
## Default initialization still works as before:
|
||||||
# doc_converter = DocumentConverter()
|
# doc_converter = DocumentConverter()
|
||||||
|
|
||||||
|
|
||||||
# previous `PipelineOptions` is now `PdfPipelineOptions`
|
# previous `PipelineOptions` is now `PdfPipelineOptions`
|
||||||
@ -68,7 +68,7 @@ pipeline_options.do_ocr = False
|
|||||||
pipeline_options.do_table_structure = True
|
pipeline_options.do_table_structure = True
|
||||||
#...
|
#...
|
||||||
|
|
||||||
## Custom options are now defined per format.
|
## Custom options are now defined per format.
|
||||||
doc_converter = (
|
doc_converter = (
|
||||||
DocumentConverter( # all of the below is optional, has internal defaults.
|
DocumentConverter( # all of the below is optional, has internal defaults.
|
||||||
allowed_formats=[
|
allowed_formats=[
|
||||||
@ -100,8 +100,8 @@ More options are shown in the following example units:
|
|||||||
|
|
||||||
### Converting documents
|
### Converting documents
|
||||||
|
|
||||||
We have simplified the way you can feed input to the `DocumentConverter` and renamed the conversion methods for
|
We have simplified the way you can feed input to the `DocumentConverter` and renamed the conversion methods for
|
||||||
better semantics. You can now call the conversion directly with a single file, or a list of input files,
|
better semantics. You can now call the conversion directly with a single file, or a list of input files,
|
||||||
or `DocumentStream` objects, without constructing a `DocumentConversionInput` object first.
|
or `DocumentStream` objects, without constructing a `DocumentConversionInput` object first.
|
||||||
|
|
||||||
* `DocumentConverter.convert` now converts a single file input (previously `DocumentConverter.convert_single`).
|
* `DocumentConverter.convert` now converts a single file input (previously `DocumentConverter.convert_single`).
|
||||||
@ -129,7 +129,7 @@ input_files = [
|
|||||||
conv_results_iter = doc_converter.convert_all(input_files) # previously `convert_batch`
|
conv_results_iter = doc_converter.convert_all(input_files) # previously `convert_batch`
|
||||||
|
|
||||||
```
|
```
|
||||||
Through the `raises_on_error` argument, you can also control if the conversion should raise exceptions when first
|
Through the `raises_on_error` argument, you can also control if the conversion should raise exceptions when first
|
||||||
encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status.
|
encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status.
|
||||||
By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).
|
By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).
|
||||||
|
|
||||||
@ -139,7 +139,7 @@ conv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Access document structures
|
### Access document structures
|
||||||
|
|
||||||
We have simplified how you can access and export the converted document data, too. Our universal document representation
|
We have simplified how you can access and export the converted document data, too. Our universal document representation
|
||||||
is now available in conversion results as a `DoclingDocument` object.
|
is now available in conversion results as a `DoclingDocument` object.
|
||||||
@ -167,7 +167,7 @@ for item, level in conv_result.document.iterate_items:
|
|||||||
conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type
|
conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type
|
||||||
```
|
```
|
||||||
|
|
||||||
## Export into JSON, Markdown, Doctags
|
### Export into JSON, Markdown, Doctags
|
||||||
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
|
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
|
||||||
and are now available on `DoclingDocument` as:
|
and are now available on `DoclingDocument` as:
|
||||||
|
|
||||||
@ -184,7 +184,7 @@ print(conv_res.document.export_to_markdown())
|
|||||||
print(conv_res.document.export_to_document_tokens())
|
print(conv_res.document.export_to_document_tokens())
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note**: While it is deprecated, you can _still_ export Docling v1 JSON format. This is available through the same
|
**Note**: While it is deprecated, you can _still_ export Docling v1 JSON format. This is available through the same
|
||||||
methods as on the `DoclingDocument` type:
|
methods as on the `DoclingDocument` type:
|
||||||
```shell
|
```shell
|
||||||
## Export legacy document representation to desired format, for v1 compatibility:
|
## Export legacy document representation to desired format, for v1 compatibility:
|
||||||
@ -193,7 +193,7 @@ print(conv_res.legacy_document.export_to_markdown())
|
|||||||
print(conv_res.legacy_document.export_to_document_tokens())
|
print(conv_res.legacy_document.export_to_document_tokens())
|
||||||
```
|
```
|
||||||
|
|
||||||
## Reload a `DoclingDocument` stored as JSON
|
### Reload a `DoclingDocument` stored as JSON
|
||||||
|
|
||||||
You can save and reload a `DoclingDocument` to disk in JSON format using the following codes:
|
You can save and reload a `DoclingDocument` to disk in JSON format using the following codes:
|
||||||
|
|
||||||
@ -211,3 +211,19 @@ with Path("./doc.json").open("r") as fp:
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Chunking
|
||||||
|
|
||||||
|
Docling v2 defines new base classes for chunking:
|
||||||
|
|
||||||
|
- `BaseMeta` for chunk metadata
|
||||||
|
- `BaseChunk` containing the chunk text and metadata, and
|
||||||
|
- `BaseChunker` for chunkers, producing chunks out of a `DoclingDocument`.
|
||||||
|
|
||||||
|
Additionally, it provides an updated `HierarchicalChunker` implementation, which
|
||||||
|
leverages the new `DoclingDocument` and provides a new, richer chunk output format, including:
|
||||||
|
|
||||||
|
- the respective doc items for grounding
|
||||||
|
- any applicable headings for context
|
||||||
|
- any applicable captions for context
|
||||||
|
|
||||||
|
For an example, check out [Chunking usage](../usage/#chunking).
|
||||||
|
@ -54,10 +54,10 @@ nav:
|
|||||||
- Get started:
|
- Get started:
|
||||||
- Home: index.md
|
- Home: index.md
|
||||||
- Installation: installation.md
|
- Installation: installation.md
|
||||||
- Usage: use_docling.md
|
- Usage: usage.md
|
||||||
- Docling v2: v2.md
|
- Docling v2: v2.md
|
||||||
- Concepts:
|
- Concepts:
|
||||||
- The Docling Document format: concepts/docling_format.md
|
- Docling Document: concepts/docling_format.md
|
||||||
# - Chunking: concepts/chunking.md
|
# - Chunking: concepts/chunking.md
|
||||||
- Examples:
|
- Examples:
|
||||||
- Conversion:
|
- Conversion:
|
||||||
|
Loading…
Reference in New Issue
Block a user