docs: add use docling (#150)

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Michele Dolfi
2024-10-17 18:14:48 +02:00
committed by GitHub
parent 24f949ada2
commit 61c092f445
10 changed files with 271 additions and 63 deletions

View File

@@ -2,7 +2,7 @@
Docling v2 introduces several new features:
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
- Produces a new, universal document representation which can encapsulate document hierarchy
- Comes with a fresh new API and CLI
@@ -22,7 +22,7 @@ docling myfile.pdf --to json --to md --no-ocr
docling ./input/dir --from pdf
# Convert PDF and Word files in input directory to Markdown and JSON
docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch
docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch
# Convert all supported files in input directory to Markdown, but abort on first error
docling ./input/dir --output ./scratch --abort-on-error
@@ -38,8 +38,8 @@ docling ./input/dir --output ./scratch --abort-on-error
### Setting up a `DocumentConverter`
To accomodate many input formats, we changed the way you need to set up your `DocumentConverter` object.
You can now define a list of allowed formats on the `DocumentConverter` initialization, and specify custom options
per-format if desired. By default, all supported formats are allowed. If you don't provide `format_options`, defaults
You can now define a list of allowed formats on the `DocumentConverter` initialization, and specify custom options
per-format if desired. By default, all supported formats are allowed. If you don't provide `format_options`, defaults
will be used for all `allowed_formats`.
Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend.
@@ -59,7 +59,7 @@ from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
## Default initialization still works as before:
# doc_converter = DocumentConverter()
# doc_converter = DocumentConverter()
# previous `PipelineOptions` is now `PdfPipelineOptions`
@@ -68,7 +68,7 @@ pipeline_options.do_ocr = False
pipeline_options.do_table_structure = True
#...
## Custom options are now defined per format.
## Custom options are now defined per format.
doc_converter = (
DocumentConverter( # all of the below is optional, has internal defaults.
allowed_formats=[
@@ -100,8 +100,8 @@ More options are shown in the following example units:
### Converting documents
We have simplified the way you can feed input to the `DocumentConverter` and renamed the conversion methods for
better semantics. You can now call the conversion directly with a single file, or a list of input files,
We have simplified the way you can feed input to the `DocumentConverter` and renamed the conversion methods for
better semantics. You can now call the conversion directly with a single file, or a list of input files,
or `DocumentStream` objects, without constructing a `DocumentConversionInput` object first.
* `DocumentConverter.convert` now converts a single file input (previously `DocumentConverter.convert_single`).
@@ -129,7 +129,7 @@ input_files = [
conv_results_iter = doc_converter.convert_all(input_files) # previously `convert_batch`
```
Through the `raises_on_error` argument, you can also control if the conversion should raise exceptions when first
Through the `raises_on_error` argument, you can also control if the conversion should raise exceptions when first
encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status.
By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).
@@ -139,7 +139,7 @@ conv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False
```
### Access document structures
### Access document structures
We have simplified how you can access and export the converted document data, too. Our universal document representation
is now available in conversion results as a `DoclingDocument` object.
@@ -167,7 +167,7 @@ for item, level in conv_result.document.iterate_items:
conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type
```
## Export into JSON, Markdown, Doctags
### Export into JSON, Markdown, Doctags
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
and are now available on `DoclingDocument` as:
@@ -184,7 +184,7 @@ print(conv_res.document.export_to_markdown())
print(conv_res.document.export_to_document_tokens())
```
**Note**: While it is deprecated, you can _still_ export Docling v1 JSON format. This is available through the same
**Note**: While it is deprecated, you can _still_ export Docling v1 JSON format. This is available through the same
methods as on the `DoclingDocument` type:
```shell
## Export legacy document representation to desired format, for v1 compatibility:
@@ -193,7 +193,7 @@ print(conv_res.legacy_document.export_to_markdown())
print(conv_res.legacy_document.export_to_document_tokens())
```
## Reload a `DoclingDocument` stored as JSON
### Reload a `DoclingDocument` stored as JSON
You can save and reload a `DoclingDocument` to disk in JSON format using the following codes:
@@ -211,3 +211,19 @@ with Path("./doc.json").open("r") as fp:
```
### Chunking
Docling v2 defines new base classes for chunking:
- `BaseMeta` for chunk metadata
- `BaseChunk` containing the chunk text and metadata, and
- `BaseChunker` for chunkers, producing chunks out of a `DoclingDocument`.
Additionally, it provides an updated `HierarchicalChunker` implementation, which
leverages the new `DoclingDocument` and provides a new, richer chunk output format, including:
- the respective doc items for grounding
- any applicable headings for context
- any applicable captions for context
For an example, check out [Chunking usage](../usage/#chunking).