mirror of
https://github.com/DS4SD/docling.git
synced 2025-08-02 15:32:30 +00:00
docs: document Docling JSON parsing
Also: - factored out and expanded supported formats - reorged feature list Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
parent
b885b2fa3c
commit
68272b987a
16
README.md
16
README.md
@ -26,11 +26,13 @@ Docling parses documents and exports them to the desired format with ease and sp
|
|||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
|
* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more
|
||||||
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
|
* 📑 Advanced PDF understanding including page layout, reading order & table structure
|
||||||
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
|
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||||
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||||
* 🔍 OCR support for scanned PDFs
|
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||||
|
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||||
|
* 🔍 OCR support for scanned PDFs and images
|
||||||
* 💻 Simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
|
Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
|
||||||
@ -120,3 +122,7 @@ For individual model usage, please refer to the model licenses found in the orig
|
|||||||
## IBM ❤️ Open Source AI
|
## IBM ❤️ Open Source AI
|
||||||
|
|
||||||
Docling has been brought to you by IBM.
|
Docling has been brought to you by IBM.
|
||||||
|
|
||||||
|
[supported_formats]: https://ds4sd.github.io/docling/supported_formats/
|
||||||
|
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
|
||||||
|
[integrations]: https://ds4sd.github.io/docling/integrations/
|
||||||
|
@ -18,11 +18,13 @@ Docling parses documents and exports them to the desired format with ease and sp
|
|||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
|
* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more
|
||||||
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
|
* 📑 Advanced PDF understanding including page layout, reading order & table structure
|
||||||
* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
|
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||||
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||||
* 🔍 OCR support for scanned PDFs
|
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||||
|
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||||
|
* 🔍 OCR support for scanned PDFs and images
|
||||||
* 💻 Simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
### Coming soon
|
### Coming soon
|
||||||
@ -42,3 +44,7 @@ Docling parses documents and exports them to the desired format with ease and sp
|
|||||||
## IBM ❤️ Open Source AI
|
## IBM ❤️ Open Source AI
|
||||||
|
|
||||||
Docling has been brought to you by IBM.
|
Docling has been brought to you by IBM.
|
||||||
|
|
||||||
|
[supported_formats]: ./supported_formats.md
|
||||||
|
[docling_document]: ./concepts/docling_document.md
|
||||||
|
[integrations]: ./integrations/index.md
|
||||||
|
33
docs/supported_formats.md
Normal file
33
docs/supported_formats.md
Normal file
@ -0,0 +1,33 @@
|
|||||||
|
Docling can parse various documents formats into a unified representation (Docling
|
||||||
|
document), which it can export to different formats too — check out
|
||||||
|
[Architecture](./concepts/architecture.md) for more details.
|
||||||
|
|
||||||
|
Below you can find a listing of all supported input and output formats.
|
||||||
|
|
||||||
|
## Supported input formats
|
||||||
|
|
||||||
|
| Format | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| PDF | |
|
||||||
|
| DOCX, XLSX, PPTX | Default formats in MS Office 2007+, based on Office Open XML |
|
||||||
|
| Markdown | |
|
||||||
|
| AsciiDoc | |
|
||||||
|
| HTML, XHTML | |
|
||||||
|
| PNG, JPEG, TIFF, BMP | Image formats |
|
||||||
|
|
||||||
|
Schema-specific support:
|
||||||
|
|
||||||
|
| Format | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
|
||||||
|
| PMC XML | XML format followed by [PubMed Central®](https://pmc.ncbi.nlm.nih.gov/) articles |
|
||||||
|
| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) |
|
||||||
|
|
||||||
|
## Supported output formats
|
||||||
|
|
||||||
|
| Format | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| HTML | Docling supports both image embedding and referencing |
|
||||||
|
| Markdown | |
|
||||||
|
| JSON | Lossless serialization of Docling Document |
|
||||||
|
| Doctags | |
|
@ -24,20 +24,6 @@ docling https://arxiv.org/pdf/2206.01062
|
|||||||
|
|
||||||
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
|
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
|
||||||
|
|
||||||
### Supported formats
|
|
||||||
|
|
||||||
The document conversion in Docling supports several popular formats, including:
|
|
||||||
|
|
||||||
- **PDF** (Portable Document Format): the format developed by Adobe to present documents compatible across application software, hardware, and operating systems.
|
|
||||||
- **.docx**, **.xlsx**, **.pptx** (Word, Excel, and PowerPoint): the Open XML formats suppored by Microsof Office.
|
|
||||||
- **Markdown**: a lightweight markup language to add formatting elements to plain text documents.
|
|
||||||
- **AsciiDoc**: a plain text markup language for writing technical content.
|
|
||||||
- **HTML** (Hypertext Markup Language): the standard markup language for creating web pages.
|
|
||||||
- **XHTML** (Extensible Hypertext Markup Language): the XML-based version of HTML.
|
|
||||||
- **XML** (Extensible Markup Language): a markup format for storing and transmitting data. Due to its flexibility, Docling requires custom implementations to identify the
|
|
||||||
semantics of the data. Currently, Docling supports the parsing of [USPTO](https://www.uspto.gov/patents) patents and [PubMed Central® (PMC)](https://pmc.ncbi.nlm.nih.gov/) articles.
|
|
||||||
|
|
||||||
|
|
||||||
### Advanced options
|
### Advanced options
|
||||||
|
|
||||||
#### Adjust pipeline features
|
#### Adjust pipeline features
|
||||||
|
@ -56,6 +56,7 @@ nav:
|
|||||||
- "Docling": index.md
|
- "Docling": index.md
|
||||||
- Installation: installation.md
|
- Installation: installation.md
|
||||||
- Usage: usage.md
|
- Usage: usage.md
|
||||||
|
- Supported formats: supported_formats.md
|
||||||
- FAQ: faq.md
|
- FAQ: faq.md
|
||||||
- Docling v2: v2.md
|
- Docling v2: v2.md
|
||||||
- Concepts:
|
- Concepts:
|
||||||
|
Loading…
Reference in New Issue
Block a user