mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-27 04:24:45 +00:00
simplify README, move content to docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
parent
0858ed91c3
commit
b7d976149b
274
README.md
274
README.md
@ -7,6 +7,7 @@
|
|||||||
# Docling
|
# Docling
|
||||||
|
|
||||||
[](https://arxiv.org/abs/2408.09869)
|
[](https://arxiv.org/abs/2408.09869)
|
||||||
|
[](https://ds4sd.github.io/docling/)
|
||||||
[](https://pypi.org/project/docling/)
|
[](https://pypi.org/project/docling/)
|
||||||

|

|
||||||
[](https://python-poetry.org/)
|
[](https://python-poetry.org/)
|
||||||
@ -16,15 +17,19 @@
|
|||||||
[](https://github.com/pre-commit/pre-commit)
|
[](https://github.com/pre-commit/pre-commit)
|
||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
|
|
||||||
Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
|
Docling parses documents and exports them to the desired format with ease and speed.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
|
|
||||||
* 📑 Understands detailed page layout, reading order and recovers table structures
|
* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
|
||||||
* 📝 Extracts metadata from the document, such as title, authors, references and language
|
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
|
||||||
* 🔍 Includes OCR support for scanned PDFs
|
* 📝 Metadata extraction, including title, authors, references & language
|
||||||
* 🤖 Integrates easily with LLM app / RAG frameworks like 🦙 LlamaIndex and 🦜🔗 LangChain
|
* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
|
||||||
* 💻 Provides a simple and convenient CLI
|
* 🔍 OCR support for scanned PDFs
|
||||||
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
|
Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
|
||||||
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@ -35,271 +40,30 @@ pip install docling
|
|||||||
|
|
||||||
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
|
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
|
||||||
|
|
||||||
<details>
|
More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.
|
||||||
<summary><b>Alternative PyTorch distributions</b></summary>
|
|
||||||
|
|
||||||
The Docling models depend on the [PyTorch](https://pytorch.org/) library.
|
|
||||||
Depending on your architecture, you might want to use a different distribution of `torch`.
|
|
||||||
For example, you might want support for different accelerator or for a cpu-only version.
|
|
||||||
All the different ways for installing `torch` are listed on their website <https://pytorch.org/>.
|
|
||||||
|
|
||||||
One common situation is the installation on Linux systems with cpu-only support.
|
|
||||||
In this case, we suggest the installation of Docling with the following options
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Example for installing on the Linux cpu-only version
|
|
||||||
pip install docling --extra-index-url https://download.pytorch.org/whl/cpu
|
|
||||||
```
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary><b>Alternative OCR engines</b></summary>
|
|
||||||
|
|
||||||
Docling supports multiple OCR engines for processing scanned documents. The current version provides
|
|
||||||
the following engines.
|
|
||||||
|
|
||||||
| Engine | Installation | Usage |
|
|
||||||
| ------ | ------------ | ----- |
|
|
||||||
| [EasyOCR](https://github.com/JaidedAI/EasyOCR) | Default in Docling or via `pip install easyocr`. | `EasyOcrOptions` |
|
|
||||||
| Tesseract | System dependency. See description for Tesseract and Tesserocr below. | `TesseractOcrOptions` |
|
|
||||||
| Tesseract CLI | System dependency. See description below. | `TesseractCliOcrOptions` |
|
|
||||||
|
|
||||||
The Docling `DocumentConverter` allows to choose the OCR engine with the `ocr_options` settings. For example
|
|
||||||
|
|
||||||
```python
|
|
||||||
from docling.datamodel.base_models import ConversionStatus, PipelineOptions
|
|
||||||
from docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions
|
|
||||||
from docling.document_converter import DocumentConverter
|
|
||||||
|
|
||||||
pipeline_options = PipelineOptions()
|
|
||||||
pipeline_options.do_ocr = True
|
|
||||||
pipeline_options.ocr_options = TesseractOcrOptions() # Use Tesseract
|
|
||||||
|
|
||||||
doc_converter = DocumentConverter(
|
|
||||||
pipeline_options=pipeline_options,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Tesseract installation
|
|
||||||
|
|
||||||
[Tesseract](https://github.com/tesseract-ocr/tesseract) is a popular OCR engine which is available
|
|
||||||
on most operating systems. For using this engine with Docling, Tesseract must be installed on your
|
|
||||||
system, using the packaging tool of your choice. Below we provide example commands.
|
|
||||||
After installing Tesseract you are expected to provide the path to its language files using the
|
|
||||||
`TESSDATA_PREFIX` environment variable (note that it must terminate with a slash `/`).
|
|
||||||
|
|
||||||
For macOS, we reccomend using [Homebrew](https://brew.sh/).
|
|
||||||
|
|
||||||
```console
|
|
||||||
brew install tesseract leptonica pkg-config
|
|
||||||
TESSDATA_PREFIX=/opt/homebrew/share/tessdata/
|
|
||||||
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
|
|
||||||
```
|
|
||||||
|
|
||||||
For Debian-based systems.
|
|
||||||
|
|
||||||
```console
|
|
||||||
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
|
|
||||||
TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)
|
|
||||||
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
|
|
||||||
```
|
|
||||||
|
|
||||||
For RHEL systems.
|
|
||||||
|
|
||||||
```console
|
|
||||||
dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel
|
|
||||||
TESSDATA_PREFIX=/usr/share/tesseract/tessdata/
|
|
||||||
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Linking to Tesseract
|
|
||||||
The most efficient usage of the Tesseract library is via linking. Docling is using
|
|
||||||
the [Tesserocr](https://github.com/sirfz/tesserocr) package for this.
|
|
||||||
|
|
||||||
If you get into installation issues of Tesserocr, we suggest using the following
|
|
||||||
installation options:
|
|
||||||
|
|
||||||
```console
|
|
||||||
pip uninstall tesserocr
|
|
||||||
pip install --no-binary :all: tesserocr
|
|
||||||
```
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary><b>Docling development setup</b></summary>
|
|
||||||
|
|
||||||
To develop for Docling (features, bugfixes etc.), install as follows from your local clone's root dir:
|
|
||||||
```bash
|
|
||||||
poetry install --all-extras
|
|
||||||
```
|
|
||||||
</details>
|
|
||||||
|
|
||||||
## Getting started
|
## Getting started
|
||||||
|
|
||||||
### Convert a single document
|
To convert invidual documents, use `convert()`, for example:
|
||||||
|
|
||||||
To convert invidual PDF documents, use `convert_single()`, for example:
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from docling.document_converter import DocumentConverter
|
from docling.document_converter import DocumentConverter
|
||||||
|
|
||||||
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
||||||
converter = DocumentConverter()
|
converter = DocumentConverter()
|
||||||
result = converter.convert_single(source)
|
result = converter.convert(source)
|
||||||
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
||||||
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
|
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
|
||||||
```
|
```
|
||||||
|
|
||||||
### Convert a batch of documents
|
|
||||||
|
|
||||||
For an example of batch-converting documents, see [batch_convert.py](https://github.com/DS4SD/docling/blob/main/examples/batch_convert.py).
|
Check out [Getting started](https://ds4sd.github.io/docling/).
|
||||||
|
You will find lots of tuning options to leverage all the advanced capabilities.
|
||||||
From a local repo clone, you can run it with:
|
|
||||||
|
|
||||||
```
|
|
||||||
python examples/batch_convert.py
|
|
||||||
```
|
|
||||||
The output of the above command will be written to `./scratch`.
|
|
||||||
|
|
||||||
### CLI
|
|
||||||
|
|
||||||
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
|
|
||||||
|
|
||||||
A simple example would look like this:
|
|
||||||
```console
|
|
||||||
docling https://arxiv.org/pdf/2206.01062
|
|
||||||
```
|
|
||||||
|
|
||||||
To see all available options (export formats etc.) run `docling --help`.
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary><b>CLI reference</b></summary>
|
|
||||||
|
|
||||||
Here are the available options as of this writing (for an up-to-date listing, run `docling --help`):
|
|
||||||
|
|
||||||
```console
|
|
||||||
$ docling --help
|
|
||||||
|
|
||||||
Usage: docling [OPTIONS] source
|
|
||||||
|
|
||||||
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
|
||||||
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │
|
|
||||||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
|
||||||
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
|
||||||
│ --json --no-json If enabled the document is exported as JSON. [default: no-json] │
|
|
||||||
│ --md --no-md If enabled the document is exported as Markdown. [default: md] │
|
|
||||||
│ --txt --no-txt If enabled the document is exported as Text. [default: no-txt] │
|
|
||||||
│ --doctags --no-doctags If enabled the document is exported as Doc Tags. [default: no-doctags] │
|
|
||||||
│ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │
|
|
||||||
│ --backend [pypdfium2|docling] The PDF backend to use. [default: docling] │
|
|
||||||
│ --output PATH Output directory where results are saved. [default: .] │
|
|
||||||
│ --version Show version information. │
|
|
||||||
│ --help Show this message and exit. │
|
|
||||||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
|
||||||
```
|
|
||||||
</details>
|
|
||||||
|
|
||||||
### RAG
|
|
||||||
Check out the following examples showcasing RAG using Docling with standard LLM application frameworks:
|
|
||||||
- [Basic RAG pipeline with LlamaIndex 🦙](https://github.com/DS4SD/docling/tree/main/docs/examples/rag_llamaindex.ipynb)
|
|
||||||
- [Basic RAG pipeline with LangChain 🦜🔗](https://github.com/DS4SD/docling/tree/main/docs/examples/rag_langchain.ipynb)
|
|
||||||
|
|
||||||
## Advanced features
|
|
||||||
|
|
||||||
### Adjust pipeline features
|
|
||||||
|
|
||||||
The example file [custom_convert.py](https://github.com/DS4SD/docling/blob/main/examples/custom_convert.py) contains multiple ways
|
|
||||||
one can adjust the conversion pipeline and features.
|
|
||||||
|
|
||||||
|
|
||||||
#### Control pipeline options
|
## Get help and support
|
||||||
|
|
||||||
You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
|
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
|
||||||
```python
|
|
||||||
doc_converter = DocumentConverter(
|
|
||||||
artifacts_path=artifacts_path,
|
|
||||||
pipeline_options=PipelineOptions(
|
|
||||||
do_table_structure=False, # controls if table structure is recovered
|
|
||||||
do_ocr=True, # controls if OCR is applied (ignores programmatic content)
|
|
||||||
),
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Control table extraction options
|
|
||||||
|
|
||||||
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
|
|
||||||
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
|
|
||||||
|
|
||||||
|
|
||||||
```python
|
|
||||||
from docling.datamodel.pipeline_options import PipelineOptions
|
|
||||||
|
|
||||||
pipeline_options = PipelineOptions(do_table_structure=True)
|
|
||||||
pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
|
|
||||||
|
|
||||||
doc_converter = DocumentConverter(
|
|
||||||
artifacts_path=artifacts_path,
|
|
||||||
pipeline_options=pipeline_options,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
|
|
||||||
|
|
||||||
```python
|
|
||||||
from docling.datamodel.pipeline_options import PipelineOptions, TableFormerMode
|
|
||||||
|
|
||||||
pipeline_options = PipelineOptions(do_table_structure=True)
|
|
||||||
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
|
|
||||||
|
|
||||||
doc_converter = DocumentConverter(
|
|
||||||
artifacts_path=artifacts_path,
|
|
||||||
pipeline_options=pipeline_options,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Impose limits on the document size
|
|
||||||
|
|
||||||
You can limit the file size and number of pages which should be allowed to process per document:
|
|
||||||
```python
|
|
||||||
conv_input = DocumentConversionInput.from_paths(
|
|
||||||
paths=[Path("./test/data/2206.01062.pdf")],
|
|
||||||
limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Convert from binary PDF streams
|
|
||||||
|
|
||||||
You can convert PDFs from a binary stream instead of from the filesystem as follows:
|
|
||||||
|
|
||||||
```python
|
|
||||||
buf = BytesIO(your_binary_stream)
|
|
||||||
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
|
|
||||||
conv_input = DocumentConversionInput.from_streams(docs)
|
|
||||||
results = doc_converter.convert_batch(conv_input)
|
|
||||||
```
|
|
||||||
### Limit resource usage
|
|
||||||
|
|
||||||
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
|
||||||
|
|
||||||
### Chunking
|
|
||||||
|
|
||||||
You can perform a hierarchy-aware chunking of a Docling document as follows:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from docling.document_converter import DocumentConverter
|
|
||||||
from docling_core.transforms.chunker import HierarchicalChunker
|
|
||||||
|
|
||||||
doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").legacy_document
|
|
||||||
chunks = list(HierarchicalChunker().chunk(doc))
|
|
||||||
print(chunks[0])
|
|
||||||
# ChunkWithMetadata(
|
|
||||||
# path='#/main-text/1',
|
|
||||||
# text='DocLayNet: A Large Human-Annotated Dataset [...]',
|
|
||||||
# page=1,
|
|
||||||
# bbox=[107.30, 672.38, 505.19, 709.08],
|
|
||||||
# [...]
|
|
||||||
# )
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
## Technical report
|
## Technical report
|
||||||
|
@ -1,5 +1,6 @@
|
|||||||
With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
|
With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
|
||||||
pydantic datatype, which can express several features common to documents, such as:
|
pydantic datatype, which can express several features common to documents, such as:
|
||||||
|
|
||||||
* Text, Tables, Pictures, and more
|
* Text, Tables, Pictures, and more
|
||||||
* Document hierarchy with sections and groups
|
* Document hierarchy with sections and groups
|
||||||
* Disambiguation between main body and headers, footers (furniture)
|
* Disambiguation between main body and headers, footers (furniture)
|
||||||
|
@ -17,13 +17,13 @@
|
|||||||
[](https://github.com/pre-commit/pre-commit)
|
[](https://github.com/pre-commit/pre-commit)
|
||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
|
|
||||||
Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
|
Docling parses documents and exports them to the desired format with ease and speed.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* ⚡ Converts PDF, Word, Powerpoint or HTML documents to JSON or Markdown format, stable and lightning fast
|
* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
|
||||||
* 📑 Understands detailed page layout, reading order and recovers table structures
|
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
|
||||||
* 📝 Extracts metadata from the document, such as title, authors, references and language
|
* 📝 Metadata extraction, including title, authors, references & language
|
||||||
* 🔍 Includes OCR support for scanned PDFs or image formats
|
* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
|
||||||
* 🤖 Integrates easily with LLM app / RAG frameworks like LlamaIndex 🦙 & LangChain 🦜🔗
|
* 🔍 OCR support for scanned PDFs
|
||||||
* 💻 Provides a simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
@ -1,6 +1,7 @@
|
|||||||
## What's new
|
## What's new
|
||||||
|
|
||||||
Docling v2 introduces several new features:
|
Docling v2 introduces several new features:
|
||||||
|
|
||||||
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
|
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
|
||||||
- Produces a new, universal document representation which can encapsulate document hierarchy
|
- Produces a new, universal document representation which can encapsulate document hierarchy
|
||||||
- Comes with a fresh new API and CLI
|
- Comes with a fresh new API and CLI
|
||||||
@ -29,6 +30,7 @@ docling ./input/dir --output ./scratch --abort-on-error
|
|||||||
```
|
```
|
||||||
|
|
||||||
**Notable changes from Docling v1:**
|
**Notable changes from Docling v1:**
|
||||||
|
|
||||||
- The standalone switches for different export formats are removed, and replaced with `--from` and `--to` arguments, to define input and output formats respectively.
|
- The standalone switches for different export formats are removed, and replaced with `--from` and `--to` arguments, to define input and output formats respectively.
|
||||||
- The new `--abort-on-error` will abort any batch conversion as soon an error is encountered
|
- The new `--abort-on-error` will abort any batch conversion as soon an error is encountered
|
||||||
- The `--backend` option for PDFs was removed
|
- The `--backend` option for PDFs was removed
|
||||||
@ -168,6 +170,7 @@ conv_result.legacy_document # provides the representation in previous ExportedCC
|
|||||||
## Export into JSON, Markdown, Doctags
|
## Export into JSON, Markdown, Doctags
|
||||||
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
|
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
|
||||||
and are now available on `DoclingDocument` as:
|
and are now available on `DoclingDocument` as:
|
||||||
|
|
||||||
- `DoclingDocument.export_to_dict`
|
- `DoclingDocument.export_to_dict`
|
||||||
- `DoclingDocument.export_to_markdown`
|
- `DoclingDocument.export_to_markdown`
|
||||||
- `DoclingDocument.export_to_document_tokens`
|
- `DoclingDocument.export_to_document_tokens`
|
||||||
|
@ -54,6 +54,7 @@ nav:
|
|||||||
- Get started:
|
- Get started:
|
||||||
- Home: index.md
|
- Home: index.md
|
||||||
- Installation: installation.md
|
- Installation: installation.md
|
||||||
|
- Use Docling: use_docling.md
|
||||||
- Docling v2: v2.md
|
- Docling v2: v2.md
|
||||||
- Concepts:
|
- Concepts:
|
||||||
- The Docling Document format: concepts/docling_format.md
|
- The Docling Document format: concepts/docling_format.md
|
||||||
|
Loading…
Reference in New Issue
Block a user