mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
docs: document Docling JSON parsing (#819)
* docs: document Docling JSON parsing Also: - factored out and expanded supported formats - reorged feature list Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * update feature list, minor fixes Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
@@ -14,20 +14,21 @@
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://pepy.tech/projects/docling)
|
||||
|
||||
Docling parses documents and exports them to the desired format with ease and speed.
|
||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||
|
||||
## Features
|
||||
|
||||
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
|
||||
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
|
||||
* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
|
||||
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||
* 🔍 OCR support for scanned PDFs
|
||||
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
||||
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||
* 💻 Simple and convenient CLI
|
||||
|
||||
### Coming soon
|
||||
|
||||
* ♾️ Equation & code extraction
|
||||
* 📝 Metadata extraction, including title, authors, references & language
|
||||
|
||||
## Get started
|
||||
@@ -42,3 +43,7 @@ Docling parses documents and exports them to the desired format with ease and sp
|
||||
## IBM ❤️ Open Source AI
|
||||
|
||||
Docling has been brought to you by IBM.
|
||||
|
||||
[supported_formats]: ./supported_formats.md
|
||||
[docling_document]: ./concepts/docling_document.md
|
||||
[integrations]: ./integrations/index.md
|
||||
|
||||
Reference in New Issue
Block a user