diff --git a/README.md b/README.md
index 9e20e86e..099a09e6 100644
--- a/README.md
+++ b/README.md
@@ -7,6 +7,7 @@
# Docling
[](https://arxiv.org/abs/2408.09869)
+[](https://ds4sd.github.io/docling/)
[](https://pypi.org/project/docling/)

[](https://python-poetry.org/)
@@ -16,15 +17,19 @@
[](https://github.com/pre-commit/pre-commit)
[](https://opensource.org/licenses/MIT)
-Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
+Docling parses documents and exports them to the desired format with ease and speed.
## Features
-* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
-* 📑 Understands detailed page layout, reading order and recovers table structures
-* 📝 Extracts metadata from the document, such as title, authors, references and language
-* 🔍 Includes OCR support for scanned PDFs
-* 🤖 Integrates easily with LLM app / RAG frameworks like 🦙 LlamaIndex and 🦜🔗 LangChain
-* 💻 Provides a simple and convenient CLI
+
+* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
+* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
+* 📝 Metadata extraction, including title, authors, references & language
+* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
+* 🔍 OCR support for scanned PDFs
+* 💻 Simple and convenient CLI
+
+Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
+
## Installation
@@ -35,271 +40,30 @@ pip install docling
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
-
- Alternative PyTorch distributions
-
- The Docling models depend on the [PyTorch](https://pytorch.org/) library.
- Depending on your architecture, you might want to use a different distribution of `torch`.
- For example, you might want support for different accelerator or for a cpu-only version.
- All the different ways for installing `torch` are listed on their website .
-
- One common situation is the installation on Linux systems with cpu-only support.
- In this case, we suggest the installation of Docling with the following options
-
- ```bash
- # Example for installing on the Linux cpu-only version
- pip install docling --extra-index-url https://download.pytorch.org/whl/cpu
- ```
-
-
-
- Alternative OCR engines
-
- Docling supports multiple OCR engines for processing scanned documents. The current version provides
- the following engines.
-
- | Engine | Installation | Usage |
- | ------ | ------------ | ----- |
- | [EasyOCR](https://github.com/JaidedAI/EasyOCR) | Default in Docling or via `pip install easyocr`. | `EasyOcrOptions` |
- | Tesseract | System dependency. See description for Tesseract and Tesserocr below. | `TesseractOcrOptions` |
- | Tesseract CLI | System dependency. See description below. | `TesseractCliOcrOptions` |
-
- The Docling `DocumentConverter` allows to choose the OCR engine with the `ocr_options` settings. For example
-
- ```python
- from docling.datamodel.base_models import ConversionStatus, PipelineOptions
- from docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions
- from docling.document_converter import DocumentConverter
-
- pipeline_options = PipelineOptions()
- pipeline_options.do_ocr = True
- pipeline_options.ocr_options = TesseractOcrOptions() # Use Tesseract
-
- doc_converter = DocumentConverter(
- pipeline_options=pipeline_options,
- )
- ```
-
- #### Tesseract installation
-
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is a popular OCR engine which is available
- on most operating systems. For using this engine with Docling, Tesseract must be installed on your
- system, using the packaging tool of your choice. Below we provide example commands.
- After installing Tesseract you are expected to provide the path to its language files using the
- `TESSDATA_PREFIX` environment variable (note that it must terminate with a slash `/`).
-
- For macOS, we reccomend using [Homebrew](https://brew.sh/).
-
- ```console
- brew install tesseract leptonica pkg-config
- TESSDATA_PREFIX=/opt/homebrew/share/tessdata/
- echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
- ```
-
- For Debian-based systems.
-
- ```console
- apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
- TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)
- echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
- ```
-
- For RHEL systems.
-
- ```console
- dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel
- TESSDATA_PREFIX=/usr/share/tesseract/tessdata/
- echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
- ```
-
- #### Linking to Tesseract
- The most efficient usage of the Tesseract library is via linking. Docling is using
- the [Tesserocr](https://github.com/sirfz/tesserocr) package for this.
-
- If you get into installation issues of Tesserocr, we suggest using the following
- installation options:
-
- ```console
- pip uninstall tesserocr
- pip install --no-binary :all: tesserocr
- ```
-
-
-
- Docling development setup
-
- To develop for Docling (features, bugfixes etc.), install as follows from your local clone's root dir:
- ```bash
- poetry install --all-extras
- ```
-
+More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.
## Getting started
-### Convert a single document
-
-To convert invidual PDF documents, use `convert_single()`, for example:
+To convert invidual documents, use `convert()`, for example:
```python
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter()
-result = converter.convert_single(source)
+result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
print(result.document.export_to_document_tokens()) # output: "..."
```
-### Convert a batch of documents
-For an example of batch-converting documents, see [batch_convert.py](https://github.com/DS4SD/docling/blob/main/examples/batch_convert.py).
-
-From a local repo clone, you can run it with:
-
-```
-python examples/batch_convert.py
-```
-The output of the above command will be written to `./scratch`.
-
-### CLI
-
-You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
-
-A simple example would look like this:
-```console
-docling https://arxiv.org/pdf/2206.01062
-```
-
-To see all available options (export formats etc.) run `docling --help`.
-
-
- CLI reference
-
- Here are the available options as of this writing (for an up-to-date listing, run `docling --help`):
-
- ```console
- $ docling --help
-
- Usage: docling [OPTIONS] source
-
- ╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
- │ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │
- ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
- ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
- │ --json --no-json If enabled the document is exported as JSON. [default: no-json] │
- │ --md --no-md If enabled the document is exported as Markdown. [default: md] │
- │ --txt --no-txt If enabled the document is exported as Text. [default: no-txt] │
- │ --doctags --no-doctags If enabled the document is exported as Doc Tags. [default: no-doctags] │
- │ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │
- │ --backend [pypdfium2|docling] The PDF backend to use. [default: docling] │
- │ --output PATH Output directory where results are saved. [default: .] │
- │ --version Show version information. │
- │ --help Show this message and exit. │
- ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
- ```
-
-
-### RAG
-Check out the following examples showcasing RAG using Docling with standard LLM application frameworks:
-- [Basic RAG pipeline with LlamaIndex 🦙](https://github.com/DS4SD/docling/tree/main/docs/examples/rag_llamaindex.ipynb)
-- [Basic RAG pipeline with LangChain 🦜🔗](https://github.com/DS4SD/docling/tree/main/docs/examples/rag_langchain.ipynb)
-
-## Advanced features
-
-### Adjust pipeline features
-
-The example file [custom_convert.py](https://github.com/DS4SD/docling/blob/main/examples/custom_convert.py) contains multiple ways
-one can adjust the conversion pipeline and features.
+Check out [Getting started](https://ds4sd.github.io/docling/).
+You will find lots of tuning options to leverage all the advanced capabilities.
-#### Control pipeline options
+## Get help and support
-You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
-```python
-doc_converter = DocumentConverter(
- artifacts_path=artifacts_path,
- pipeline_options=PipelineOptions(
- do_table_structure=False, # controls if table structure is recovered
- do_ocr=True, # controls if OCR is applied (ignores programmatic content)
- ),
-)
-```
-
-#### Control table extraction options
-
-You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
-This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
-
-
-```python
-from docling.datamodel.pipeline_options import PipelineOptions
-
-pipeline_options = PipelineOptions(do_table_structure=True)
-pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
-
-doc_converter = DocumentConverter(
- artifacts_path=artifacts_path,
- pipeline_options=pipeline_options,
-)
-```
-
-Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
-
-```python
-from docling.datamodel.pipeline_options import PipelineOptions, TableFormerMode
-
-pipeline_options = PipelineOptions(do_table_structure=True)
-pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
-
-doc_converter = DocumentConverter(
- artifacts_path=artifacts_path,
- pipeline_options=pipeline_options,
-)
-```
-
-### Impose limits on the document size
-
-You can limit the file size and number of pages which should be allowed to process per document:
-```python
-conv_input = DocumentConversionInput.from_paths(
- paths=[Path("./test/data/2206.01062.pdf")],
- limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
-)
-```
-
-### Convert from binary PDF streams
-
-You can convert PDFs from a binary stream instead of from the filesystem as follows:
-
-```python
-buf = BytesIO(your_binary_stream)
-docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
-conv_input = DocumentConversionInput.from_streams(docs)
-results = doc_converter.convert_batch(conv_input)
-```
-### Limit resource usage
-
-You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
-
-### Chunking
-
-You can perform a hierarchy-aware chunking of a Docling document as follows:
-
-```python
-from docling.document_converter import DocumentConverter
-from docling_core.transforms.chunker import HierarchicalChunker
-
-doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").legacy_document
-chunks = list(HierarchicalChunker().chunk(doc))
-print(chunks[0])
-# ChunkWithMetadata(
-# path='#/main-text/1',
-# text='DocLayNet: A Large Human-Annotated Dataset [...]',
-# page=1,
-# bbox=[107.30, 672.38, 505.19, 709.08],
-# [...]
-# )
-```
+Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
## Technical report
diff --git a/docs/concepts/docling_format.md b/docs/concepts/docling_format.md
index ab7933b0..0e84e44f 100644
--- a/docs/concepts/docling_format.md
+++ b/docs/concepts/docling_format.md
@@ -1,5 +1,6 @@
With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
pydantic datatype, which can express several features common to documents, such as:
+
* Text, Tables, Pictures, and more
* Document hierarchy with sections and groups
* Disambiguation between main body and headers, footers (furniture)
diff --git a/docs/index.md b/docs/index.md
index bc96c9a7..146a5d49 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -17,13 +17,13 @@
[](https://github.com/pre-commit/pre-commit)
[](https://opensource.org/licenses/MIT)
-Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
+Docling parses documents and exports them to the desired format with ease and speed.
## Features
-* ⚡ Converts PDF, Word, Powerpoint or HTML documents to JSON or Markdown format, stable and lightning fast
-* 📑 Understands detailed page layout, reading order and recovers table structures
-* 📝 Extracts metadata from the document, such as title, authors, references and language
-* 🔍 Includes OCR support for scanned PDFs or image formats
-* 🤖 Integrates easily with LLM app / RAG frameworks like LlamaIndex 🦙 & LangChain 🦜🔗
-* 💻 Provides a simple and convenient CLI
+* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
+* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
+* 📝 Metadata extraction, including title, authors, references & language
+* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
+* 🔍 OCR support for scanned PDFs
+* 💻 Simple and convenient CLI
diff --git a/docs/v2.md b/docs/v2.md
index 6c11f6bc..0e513ecc 100644
--- a/docs/v2.md
+++ b/docs/v2.md
@@ -1,6 +1,7 @@
## What's new
Docling v2 introduces several new features:
+
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
- Produces a new, universal document representation which can encapsulate document hierarchy
- Comes with a fresh new API and CLI
@@ -29,6 +30,7 @@ docling ./input/dir --output ./scratch --abort-on-error
```
**Notable changes from Docling v1:**
+
- The standalone switches for different export formats are removed, and replaced with `--from` and `--to` arguments, to define input and output formats respectively.
- The new `--abort-on-error` will abort any batch conversion as soon an error is encountered
- The `--backend` option for PDFs was removed
@@ -168,6 +170,7 @@ conv_result.legacy_document # provides the representation in previous ExportedCC
## Export into JSON, Markdown, Doctags
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
and are now available on `DoclingDocument` as:
+
- `DoclingDocument.export_to_dict`
- `DoclingDocument.export_to_markdown`
- `DoclingDocument.export_to_document_tokens`
diff --git a/mkdocs.yml b/mkdocs.yml
index 3835e250..5fd180a4 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -54,6 +54,7 @@ nav:
- Get started:
- Home: index.md
- Installation: installation.md
+ - Use Docling: use_docling.md
- Docling v2: v2.md
- Concepts:
- The Docling Document format: concepts/docling_format.md