docling/docs/v2.md
Christoph Auer 74e0452b6a Add migration instructions to doc (wip)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-15 17:08:48 +02:00

3.9 KiB

What's new

Docling v2 introduces several new features:

  • Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
  • Produces a new, universal document representation which can encapsulate document hierarchy
  • Comes with a fresh new API and CLI

Migration from v1

Setting up a DocumentConverter

To accomodate many input formats, we changed the way you need to set up your DocumentConverter object. You can now define a list of allowed formats on the DocumentConverter initialization, and specify custom options per-format if desired. By default, all supported formats are allowed. If you don't provide format_options, defaults will be used for all allowed_formats.

Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend. They are provided as format-specific types, such as PdfFormatOption or WordFormatOption, as seen below.

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
    DocumentConverter,
    PdfFormatOption,
    WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

## Default initialization still works as before:
# doc_converter = DocumentConverter() 

## Custom options are now defined per format. 
doc_converter = (
    DocumentConverter(  # all of the below is optional, has internal defaults.
        allowed_formats=[
            InputFormat.PDF,
            InputFormat.IMAGE,
            InputFormat.DOCX,
            InputFormat.HTML,
            InputFormat.PPTX,
        ],  # whitelist formats, non-matching files are ignored.
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend
            ),
            InputFormat.DOCX: WordFormatOption(
                pipeline_cls=SimplePipeline  # , backend=MsWordDocumentBackend
            ),
        },
    )
)

Note: If you work only with defaults, all remains the same as in Docling v1.

Converting documents

We have simplified the way you can feed input to the DocumentConverter and renamed the conversion methods for better semantics. You can now call the conversion directly with a single file, or a list of input files, or DocumentStream objects, without constructing a DocumentConversionInput object first.

  • DocumentConverter.convert now converts a single file input (previously DocumentConverter.convert_single).
  • DocumentConverter.convert_all now converts many files at once (previously DocumentConverter.convert).
...
## Convert a single file (from URL or local path)
conv_result = doc_converter.convert("https://arxiv.org/pdf/2408.09869") # previously `convert_single`

## Convert several files at once:

input_files = [
    "tests/data/wiki_duck.html",
    "tests/data/word_sample.docx",
    "tests/data/lorem_ipsum.docx",
    "tests/data/powerpoint_sample.pptx",
    "tests/data/2305.03393v1-pg9-img.png",
    "tests/data/2206.01062.pdf",
]

conv_results_iter = doc_converter.convert_all(input_files) # previously `convert_batch`

Through the raises_on_error argument, you can also control if the conversion should raise exceptions when first encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status. By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).

...
conv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False) # previously `convert_batch`

Exporting documents into JSON, Markdown, Doctags

We have simplified how you can access and export the converted document data, too.

TBD.

CLI

TBD.