mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-27 04:24:45 +00:00
doc refinements
[skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
parent
640c7d0c9f
commit
3ec2d9479f
@ -53,7 +53,6 @@ source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
||||
converter = DocumentConverter()
|
||||
result = converter.convert(source)
|
||||
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
||||
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
|
||||
```
|
||||
|
||||
|
||||
|
@ -9,13 +9,13 @@ pydantic datatype, which can express several features common to documents, such
|
||||
|
||||
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
|
||||
|
||||
# Example document structures
|
||||
## Example document structures
|
||||
|
||||
To illustrate the features of the `DoclingDocument` format, consider the following side-by-side comparison of a
|
||||
`DoclingDocument` converted from `test/data/word_sample.docx`. Left side shows snippets from the converted document
|
||||
serialized as YAML, right side shows the corresponding visual parts in MS Word.
|
||||
|
||||
## Basic structure
|
||||
### Basic structure
|
||||
|
||||
A `DoclingDocument` exposes top-level fields for the document content, organized in two categories.
|
||||
The first category is the _content items_, which are stored in these fields:
|
||||
@ -44,7 +44,7 @@ Below example shows how all items in the first page are nested below the `title`
|
||||
|
||||

|
||||
|
||||
## Grouping
|
||||
### Grouping
|
||||
|
||||
Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of
|
||||
"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the
|
||||
@ -52,14 +52,14 @@ top-level `groups` field.
|
||||
|
||||

|
||||
|
||||
## Tables
|
||||
### Tables
|
||||
|
||||
TBD
|
||||
|
||||
## Pictures
|
||||
### Pictures
|
||||
|
||||
TBD
|
||||
|
||||
## Provenance
|
||||
### Provenance
|
||||
|
||||
TBD
|
@ -1,4 +1,6 @@
|
||||
## Convert a single document
|
||||
## Conversion
|
||||
|
||||
### Convert a single document
|
||||
|
||||
To convert invidual PDF documents, use `convert()`, for example:
|
||||
|
||||
@ -8,11 +10,10 @@ from docling.document_converter import DocumentConverter
|
||||
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
||||
converter = DocumentConverter()
|
||||
result = converter.convert(source)
|
||||
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
||||
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
|
||||
print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"
|
||||
```
|
||||
|
||||
## CLI
|
||||
### CLI
|
||||
|
||||
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
|
||||
|
||||
@ -61,15 +62,15 @@ To see all available options (export formats etc.) run `docling --help`.
|
||||
|
||||
|
||||
|
||||
## Advanced options
|
||||
### Advanced options
|
||||
|
||||
### Adjust pipeline features
|
||||
#### Adjust pipeline features
|
||||
|
||||
The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
|
||||
one can adjust the conversion pipeline and features.
|
||||
|
||||
|
||||
#### Control PDF table extraction options
|
||||
##### Control PDF table extraction options
|
||||
|
||||
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
|
||||
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
|
||||
@ -107,7 +108,7 @@ doc_converter = DocumentConverter(
|
||||
)
|
||||
```
|
||||
|
||||
### Impose limits on the document size
|
||||
#### Impose limits on the document size
|
||||
|
||||
You can limit the file size and number of pages which should be allowed to process per document:
|
||||
|
||||
@ -120,7 +121,7 @@ converter = DocumentConverter()
|
||||
result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
|
||||
```
|
||||
|
||||
### Convert from binary PDF streams
|
||||
#### Convert from binary PDF streams
|
||||
|
||||
You can convert PDFs from a binary stream instead of from the filesystem as follows:
|
||||
|
||||
@ -135,12 +136,12 @@ converter = DocumentConverter()
|
||||
result = converter.convert(source)
|
||||
```
|
||||
|
||||
### Limit resource usage
|
||||
#### Limit resource usage
|
||||
|
||||
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
||||
|
||||
|
||||
### Chunking
|
||||
## Chunking
|
||||
|
||||
You can perform a hierarchy-aware chunking of a Docling document as follows:
|
||||
|
20
docs/v2.md
20
docs/v2.md
@ -167,7 +167,7 @@ for item, level in conv_result.document.iterate_items:
|
||||
conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type
|
||||
```
|
||||
|
||||
## Export into JSON, Markdown, Doctags
|
||||
### Export into JSON, Markdown, Doctags
|
||||
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
|
||||
and are now available on `DoclingDocument` as:
|
||||
|
||||
@ -193,7 +193,7 @@ print(conv_res.legacy_document.export_to_markdown())
|
||||
print(conv_res.legacy_document.export_to_document_tokens())
|
||||
```
|
||||
|
||||
## Reload a `DoclingDocument` stored as JSON
|
||||
### Reload a `DoclingDocument` stored as JSON
|
||||
|
||||
You can save and reload a `DoclingDocument` to disk in JSON format using the following codes:
|
||||
|
||||
@ -211,3 +211,19 @@ with Path("./doc.json").open("r") as fp:
|
||||
|
||||
```
|
||||
|
||||
### Chunking
|
||||
|
||||
Docling v2 defines new base classes for chunking:
|
||||
|
||||
- `BaseMeta` for chunk metadata
|
||||
- `BaseChunk` containing the chunk text and metadata, and
|
||||
- `BaseChunker` for chunkers, producing chunks out of a `DoclingDocument`.
|
||||
|
||||
Additionally, it provides an updated `HierarchicalChunker` implementation, which
|
||||
leverages the new `DoclingDocument` and provides a new, richer chunk output format, including:
|
||||
|
||||
- the respective doc items for grounding
|
||||
- any applicable headings for context
|
||||
- any applicable captions for context
|
||||
|
||||
For an example, check out [Chunking usage](../usage/#chunking).
|
||||
|
@ -54,10 +54,10 @@ nav:
|
||||
- Get started:
|
||||
- Home: index.md
|
||||
- Installation: installation.md
|
||||
- Usage: use_docling.md
|
||||
- Usage: usage.md
|
||||
- Docling v2: v2.md
|
||||
- Concepts:
|
||||
- The Docling Document format: concepts/docling_format.md
|
||||
- Docling Document: concepts/docling_format.md
|
||||
# - Chunking: concepts/chunking.md
|
||||
- Examples:
|
||||
- Conversion:
|
||||
|
Loading…
Reference in New Issue
Block a user