3.9 KiB
What's new
Docling v2 introduces several new features:
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
- Produces a new, universal document representation which can encapsulate document hierarchy
- Comes with a fresh new API and CLI
Migration from v1
Setting up a DocumentConverter
To accomodate many input formats, we changed the way you need to set up your DocumentConverter
object.
You can now define a list of allowed formats on the DocumentConverter
initialization, and specify custom options
per-format if desired. By default, all supported formats are allowed. If you don't provide format_options
, defaults
will be used for all allowed_formats
.
Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend.
They are provided as format-specific types, such as PdfFormatOption
or WordFormatOption
, as seen below.
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
DocumentConverter,
PdfFormatOption,
WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
## Default initialization still works as before:
# doc_converter = DocumentConverter()
## Custom options are now defined per format.
doc_converter = (
DocumentConverter( # all of the below is optional, has internal defaults.
allowed_formats=[
InputFormat.PDF,
InputFormat.IMAGE,
InputFormat.DOCX,
InputFormat.HTML,
InputFormat.PPTX,
], # whitelist formats, non-matching files are ignored.
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend
),
InputFormat.DOCX: WordFormatOption(
pipeline_cls=SimplePipeline # , backend=MsWordDocumentBackend
),
},
)
)
Note: If you work only with defaults, all remains the same as in Docling v1.
Converting documents
We have simplified the way you can feed input to the DocumentConverter
and renamed the conversion methods for
better semantics. You can now call the conversion directly with a single file, or a list of input files,
or DocumentStream
objects, without constructing a DocumentConversionInput
object first.
DocumentConverter.convert
now converts a single file input (previouslyDocumentConverter.convert_single
).DocumentConverter.convert_all
now converts many files at once (previouslyDocumentConverter.convert
).
...
## Convert a single file (from URL or local path)
conv_result = doc_converter.convert("https://arxiv.org/pdf/2408.09869") # previously `convert_single`
## Convert several files at once:
input_files = [
"tests/data/wiki_duck.html",
"tests/data/word_sample.docx",
"tests/data/lorem_ipsum.docx",
"tests/data/powerpoint_sample.pptx",
"tests/data/2305.03393v1-pg9-img.png",
"tests/data/2206.01062.pdf",
]
conv_results_iter = doc_converter.convert_all(input_files) # previously `convert_batch`
Through the raises_on_error
argument, you can also control if the conversion should raise exceptions when first
encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status.
By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).
...
conv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False) # previously `convert_batch`
Exporting documents into JSON, Markdown, Doctags
We have simplified how you can access and export the converted document data, too.
TBD.
CLI
TBD.