update feature list, minor fixes

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Panos Vagenas 2025-01-28 12:39:13 +01:00
parent 68272b987a
commit e7930b547c
4 changed files with 21 additions and 17 deletions

View File

@ -22,24 +22,21 @@
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling) [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
Docling parses documents and exports them to the desired format with ease and speed. Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
## Features ## Features
* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more * 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
* 📑 Advanced PDF understanding including page layout, reading order & table structure * 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON * ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments * 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs and images * 🔍 Extensive OCR support for scanned PDFs and images
* 💻 Simple and convenient CLI * 💻 Simple and convenient CLI
Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
### Coming soon ### Coming soon
* ♾️ Equation & code extraction
* 📝 Metadata extraction, including title, authors, references & language * 📝 Metadata extraction, including title, authors, references & language
## Installation ## Installation

View File

@ -14,22 +14,21 @@
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling) [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
Docling parses documents and exports them to the desired format with ease and speed. Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
## Features ## Features
* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more * 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
* 📑 Advanced PDF understanding including page layout, reading order & table structure * 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON * ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments * 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs and images * 🔍 Extensive OCR support for scanned PDFs and images
* 💻 Simple and convenient CLI * 💻 Simple and convenient CLI
### Coming soon ### Coming soon
* ♾️ Equation & code extraction
* 📝 Metadata extraction, including title, authors, references & language * 📝 Metadata extraction, including title, authors, references & language
## Get started ## Get started

View File

@ -1,5 +1,5 @@
Docling can parse various documents formats into a unified representation (Docling Docling can parse various documents formats into a unified representation (Docling
document), which it can export to different formats too — check out Document), which it can export to different formats too — check out
[Architecture](./concepts/architecture.md) for more details. [Architecture](./concepts/architecture.md) for more details.
Below you can find a listing of all supported input and output formats. Below you can find a listing of all supported input and output formats.
@ -27,7 +27,8 @@ Schema-specific support:
| Format | Description | | Format | Description |
|--------|-------------| |--------|-------------|
| HTML | Docling supports both image embedding and referencing | | HTML | Both image embedding and referencing are supported |
| Markdown | | | Markdown | |
| JSON | Lossless serialization of Docling Document | | JSON | Lossless serialization of Docling Document |
| Text | Plain text, i.e. without Markdown markers |
| Doctags | | | Doctags | |

View File

@ -128,7 +128,14 @@ You can limit the CPU threads used by Docling by setting the environment variabl
#### Use specific backend converters #### Use specific backend converters
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)). !!! note
This section discusses directly invoking a [backend](./concepts/architecture.md),
i.e. using a low-level API. This should only be done when necessary. For most cases,
using a `DocumentConverter` (high-level API) as discussed in the sections above
should suffice  and is the recommended way.
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example. You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages: Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
@ -148,8 +155,8 @@ in_doc = InputDocument(
filename="duck.html", filename="duck.html",
) )
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text)) backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
result = backend.convert() dl_doc = backend.convert()
print(result.export_to_markdown()) print(dl_doc.export_to_markdown())
``` ```
## Chunking ## Chunking