diff --git a/README.md b/README.md index dbbbba45..8050365f 100644 --- a/README.md +++ b/README.md @@ -22,24 +22,21 @@ [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling) -Docling parses documents and exports them to the desired format with ease and speed. +Docling simplifies document processing, parsing diverse formats β€” including advanced PDF understanding β€” and providing seamless integrations with the gen AI ecosystem. ## Features -* πŸ—‚οΈ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more -* πŸ“‘ Advanced PDF understanding including page layout, reading order & table structure +* πŸ—‚οΈ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more +* πŸ“‘ Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format * β†ͺ️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON * πŸ”’ Local execution capabilities for sensitive data and air-gapped environments * πŸ€– Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI -* πŸ” OCR support for scanned PDFs and images +* πŸ” Extensive OCR support for scanned PDFs and images * πŸ’» Simple and convenient CLI -Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling! - ### Coming soon -* ♾️ Equation & code extraction * πŸ“ Metadata extraction, including title, authors, references & language ## Installation diff --git a/docs/index.md b/docs/index.md index 9f782dc1..f44e6dba 100644 --- a/docs/index.md +++ b/docs/index.md @@ -14,22 +14,21 @@ [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling) -Docling parses documents and exports them to the desired format with ease and speed. +Docling simplifies document processing, parsing diverse formats β€” including advanced PDF understanding β€” and providing seamless integrations with the gen AI ecosystem. ## Features -* πŸ—‚οΈ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more -* πŸ“‘ Advanced PDF understanding including page layout, reading order & table structure +* πŸ—‚οΈ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more +* πŸ“‘ Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format * β†ͺ️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON * πŸ”’ Local execution capabilities for sensitive data and air-gapped environments * πŸ€– Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI -* πŸ” OCR support for scanned PDFs and images +* πŸ” Extensive OCR support for scanned PDFs and images * πŸ’» Simple and convenient CLI ### Coming soon -* ♾️ Equation & code extraction * πŸ“ Metadata extraction, including title, authors, references & language ## Get started diff --git a/docs/supported_formats.md b/docs/supported_formats.md index d3e66b05..e217bb19 100644 --- a/docs/supported_formats.md +++ b/docs/supported_formats.md @@ -1,5 +1,5 @@ Docling can parse various documents formats into a unified representation (Docling -document), which it can export to different formats too β€” check out +Document), which it can export to different formats too β€” check out [Architecture](./concepts/architecture.md) for more details. Below you can find a listing of all supported input and output formats. @@ -27,7 +27,8 @@ Schema-specific support: | Format | Description | |--------|-------------| -| HTML | Docling supports both image embedding and referencing | +| HTML | Both image embedding and referencing are supported | | Markdown | | | JSON | Lossless serialization of Docling Document | +| Text | Plain text, i.e. without Markdown markers | | Doctags | | diff --git a/docs/usage.md b/docs/usage.md index 490f7cc5..a577a3e3 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -128,7 +128,14 @@ You can limit the CPU threads used by Docling by setting the environment variabl #### Use specific backend converters -By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)). +!!! note + + This section discusses directly invoking a [backend](./concepts/architecture.md), + i.e. using a low-level API. This should only be done when necessary. For most cases, + using a `DocumentConverter` (high-level API) as discussed in the sections above + should sufficeΒ β€”Β and is the recommended way. + +By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)). You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example. Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages: @@ -148,8 +155,8 @@ in_doc = InputDocument( filename="duck.html", ) backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text)) -result = backend.convert() -print(result.export_to_markdown()) +dl_doc = backend.convert() +print(dl_doc.export_to_markdown()) ``` ## Chunking