mirror of
https://github.com/DS4SD/docling.git
synced 2025-08-02 07:22:14 +00:00
update feature list, minor fixes
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
parent
68272b987a
commit
e7930b547c
11
README.md
11
README.md
@ -22,24 +22,21 @@
|
|||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
[](https://pepy.tech/projects/docling)
|
[](https://pepy.tech/projects/docling)
|
||||||
|
|
||||||
Docling parses documents and exports them to the desired format with ease and speed.
|
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more
|
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
||||||
* 📑 Advanced PDF understanding including page layout, reading order & table structure
|
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||||
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||||
* 🔍 OCR support for scanned PDFs and images
|
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||||
* 💻 Simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
|
|
||||||
|
|
||||||
### Coming soon
|
### Coming soon
|
||||||
|
|
||||||
* ♾️ Equation & code extraction
|
|
||||||
* 📝 Metadata extraction, including title, authors, references & language
|
* 📝 Metadata extraction, including title, authors, references & language
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
@ -14,22 +14,21 @@
|
|||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
[](https://pepy.tech/projects/docling)
|
[](https://pepy.tech/projects/docling)
|
||||||
|
|
||||||
Docling parses documents and exports them to the desired format with ease and speed.
|
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more
|
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
||||||
* 📑 Advanced PDF understanding including page layout, reading order & table structure
|
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||||
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||||
* 🔍 OCR support for scanned PDFs and images
|
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||||
* 💻 Simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
### Coming soon
|
### Coming soon
|
||||||
|
|
||||||
* ♾️ Equation & code extraction
|
|
||||||
* 📝 Metadata extraction, including title, authors, references & language
|
* 📝 Metadata extraction, including title, authors, references & language
|
||||||
|
|
||||||
## Get started
|
## Get started
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
Docling can parse various documents formats into a unified representation (Docling
|
Docling can parse various documents formats into a unified representation (Docling
|
||||||
document), which it can export to different formats too — check out
|
Document), which it can export to different formats too — check out
|
||||||
[Architecture](./concepts/architecture.md) for more details.
|
[Architecture](./concepts/architecture.md) for more details.
|
||||||
|
|
||||||
Below you can find a listing of all supported input and output formats.
|
Below you can find a listing of all supported input and output formats.
|
||||||
@ -27,7 +27,8 @@ Schema-specific support:
|
|||||||
|
|
||||||
| Format | Description |
|
| Format | Description |
|
||||||
|--------|-------------|
|
|--------|-------------|
|
||||||
| HTML | Docling supports both image embedding and referencing |
|
| HTML | Both image embedding and referencing are supported |
|
||||||
| Markdown | |
|
| Markdown | |
|
||||||
| JSON | Lossless serialization of Docling Document |
|
| JSON | Lossless serialization of Docling Document |
|
||||||
|
| Text | Plain text, i.e. without Markdown markers |
|
||||||
| Doctags | |
|
| Doctags | |
|
||||||
|
@ -128,7 +128,14 @@ You can limit the CPU threads used by Docling by setting the environment variabl
|
|||||||
|
|
||||||
#### Use specific backend converters
|
#### Use specific backend converters
|
||||||
|
|
||||||
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)).
|
!!! note
|
||||||
|
|
||||||
|
This section discusses directly invoking a [backend](./concepts/architecture.md),
|
||||||
|
i.e. using a low-level API. This should only be done when necessary. For most cases,
|
||||||
|
using a `DocumentConverter` (high-level API) as discussed in the sections above
|
||||||
|
should suffice — and is the recommended way.
|
||||||
|
|
||||||
|
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
|
||||||
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
|
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
|
||||||
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
|
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
|
||||||
|
|
||||||
@ -148,8 +155,8 @@ in_doc = InputDocument(
|
|||||||
filename="duck.html",
|
filename="duck.html",
|
||||||
)
|
)
|
||||||
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
|
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
|
||||||
result = backend.convert()
|
dl_doc = backend.convert()
|
||||||
print(result.export_to_markdown())
|
print(dl_doc.export_to_markdown())
|
||||||
```
|
```
|
||||||
|
|
||||||
## Chunking
|
## Chunking
|
||||||
|
Loading…
Reference in New Issue
Block a user