doc refinements (#152)

* doc refinements

[skip ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* improve docling document concept page

[skip ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Panos Vagenas 2024-10-17 10:01:53 +02:00 committed by GitHub
parent 3d4967f0db
commit 8fcf741ad0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 62 additions and 45 deletions

View File

@ -53,7 +53,6 @@ source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter() converter = DocumentConverter()
result = converter.convert(source) result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]" print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
``` ```

View File

@ -9,13 +9,14 @@ pydantic datatype, which can express several features common to documents, such
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch. It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
# Example document structures ## Example document structures
To illustrate the features of the `DoclingDocument` format, consider the following side-by-side comparison of a To illustrate the features of the `DoclingDocument` format, in the subsections below we consider the
`DoclingDocument` converted from `test/data/word_sample.docx`. Left side shows snippets from the converted document `DoclingDocument` converted from `tests/data/word_sample.docx` and we present some side-by-side comparisons,
serialized as YAML, right side shows the corresponding visual parts in MS Word. where the left side shows snippets from the converted document
serialized as YAML and the right one shows the corresponding parts of the original MS Word.
## Basic structure ### Basic structure
A `DoclingDocument` exposes top-level fields for the document content, organized in two categories. A `DoclingDocument` exposes top-level fields for the document content, organized in two categories.
The first category is the _content items_, which are stored in these fields: The first category is the _content items_, which are stored in these fields:
@ -44,7 +45,7 @@ Below example shows how all items in the first page are nested below the `title`
![doc_hierarchy_1](../assets/docling_doc_hierarchy_1.png) ![doc_hierarchy_1](../assets/docling_doc_hierarchy_1.png)
## Grouping ### Grouping
Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of
"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the "Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the
@ -52,14 +53,14 @@ top-level `groups` field.
![doc_hierarchy_2](../assets/docling_doc_hierarchy_2.png) ![doc_hierarchy_2](../assets/docling_doc_hierarchy_2.png)
## Tables ### Tables
TBD TBD
## Pictures ### Pictures
TBD TBD
## Provenance ### Provenance
TBD TBD

View File

@ -1,4 +1,6 @@
## Convert a single document ## Conversion
### Convert a single document
To convert invidual PDF documents, use `convert()`, for example: To convert invidual PDF documents, use `convert()`, for example:
@ -8,11 +10,10 @@ from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter() converter = DocumentConverter()
result = converter.convert(source) result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]" print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
``` ```
## CLI ### CLI
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories. You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
@ -61,15 +62,15 @@ To see all available options (export formats etc.) run `docling --help`.
## Advanced options ### Advanced options
### Adjust pipeline features #### Adjust pipeline features
The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
one can adjust the conversion pipeline and features. one can adjust the conversion pipeline and features.
#### Control PDF table extraction options ##### Control PDF table extraction options
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one. This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
@ -107,7 +108,7 @@ doc_converter = DocumentConverter(
) )
``` ```
### Impose limits on the document size #### Impose limits on the document size
You can limit the file size and number of pages which should be allowed to process per document: You can limit the file size and number of pages which should be allowed to process per document:
@ -120,7 +121,7 @@ converter = DocumentConverter()
result = converter.convert(source, max_num_pages=100, max_file_size=20971520) result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
``` ```
### Convert from binary PDF streams #### Convert from binary PDF streams
You can convert PDFs from a binary stream instead of from the filesystem as follows: You can convert PDFs from a binary stream instead of from the filesystem as follows:
@ -135,12 +136,12 @@ converter = DocumentConverter()
result = converter.convert(source) result = converter.convert(source)
``` ```
### Limit resource usage #### Limit resource usage
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
### Chunking ## Chunking
You can perform a hierarchy-aware chunking of a Docling document as follows: You can perform a hierarchy-aware chunking of a Docling document as follows:

View File

@ -167,7 +167,7 @@ for item, level in conv_result.document.iterate_items:
conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type
``` ```
## Export into JSON, Markdown, Doctags ### Export into JSON, Markdown, Doctags
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2, **Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
and are now available on `DoclingDocument` as: and are now available on `DoclingDocument` as:
@ -193,7 +193,7 @@ print(conv_res.legacy_document.export_to_markdown())
print(conv_res.legacy_document.export_to_document_tokens()) print(conv_res.legacy_document.export_to_document_tokens())
``` ```
## Reload a `DoclingDocument` stored as JSON ### Reload a `DoclingDocument` stored as JSON
You can save and reload a `DoclingDocument` to disk in JSON format using the following codes: You can save and reload a `DoclingDocument` to disk in JSON format using the following codes:
@ -211,3 +211,19 @@ with Path("./doc.json").open("r") as fp:
``` ```
### Chunking
Docling v2 defines new base classes for chunking:
- `BaseMeta` for chunk metadata
- `BaseChunk` containing the chunk text and metadata, and
- `BaseChunker` for chunkers, producing chunks out of a `DoclingDocument`.
Additionally, it provides an updated `HierarchicalChunker` implementation, which
leverages the new `DoclingDocument` and provides a new, richer chunk output format, including:
- the respective doc items for grounding
- any applicable headings for context
- any applicable captions for context
For an example, check out [Chunking usage](../usage/#chunking).

View File

@ -54,10 +54,10 @@ nav:
- Get started: - Get started:
- Home: index.md - Home: index.md
- Installation: installation.md - Installation: installation.md
- Usage: use_docling.md - Usage: usage.md
- Docling v2: v2.md - Docling v2: v2.md
- Concepts: - Concepts:
- The Docling Document format: concepts/docling_format.md - Docling Document: concepts/docling_document.md
# - Chunking: concepts/chunking.md # - Chunking: concepts/chunking.md
- Examples: - Examples:
- Conversion: - Conversion: