mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-27 04:24:45 +00:00
simplify README, move content to docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
parent
0858ed91c3
commit
b7d976149b
274
README.md
274
README.md
@ -7,6 +7,7 @@
|
||||
# Docling
|
||||
|
||||
[](https://arxiv.org/abs/2408.09869)
|
||||
[](https://ds4sd.github.io/docling/)
|
||||
[](https://pypi.org/project/docling/)
|
||||

|
||||
[](https://python-poetry.org/)
|
||||
@ -16,15 +17,19 @@
|
||||
[](https://github.com/pre-commit/pre-commit)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
|
||||
Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
|
||||
Docling parses documents and exports them to the desired format with ease and speed.
|
||||
|
||||
## Features
|
||||
* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
|
||||
* 📑 Understands detailed page layout, reading order and recovers table structures
|
||||
* 📝 Extracts metadata from the document, such as title, authors, references and language
|
||||
* 🔍 Includes OCR support for scanned PDFs
|
||||
* 🤖 Integrates easily with LLM app / RAG frameworks like 🦙 LlamaIndex and 🦜🔗 LangChain
|
||||
* 💻 Provides a simple and convenient CLI
|
||||
|
||||
* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
|
||||
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
|
||||
* 📝 Metadata extraction, including title, authors, references & language
|
||||
* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
|
||||
* 🔍 OCR support for scanned PDFs
|
||||
* 💻 Simple and convenient CLI
|
||||
|
||||
Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
@ -35,271 +40,30 @@ pip install docling
|
||||
|
||||
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
|
||||
|
||||
<details>
|
||||
<summary><b>Alternative PyTorch distributions</b></summary>
|
||||
|
||||
The Docling models depend on the [PyTorch](https://pytorch.org/) library.
|
||||
Depending on your architecture, you might want to use a different distribution of `torch`.
|
||||
For example, you might want support for different accelerator or for a cpu-only version.
|
||||
All the different ways for installing `torch` are listed on their website <https://pytorch.org/>.
|
||||
|
||||
One common situation is the installation on Linux systems with cpu-only support.
|
||||
In this case, we suggest the installation of Docling with the following options
|
||||
|
||||
```bash
|
||||
# Example for installing on the Linux cpu-only version
|
||||
pip install docling --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
```
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Alternative OCR engines</b></summary>
|
||||
|
||||
Docling supports multiple OCR engines for processing scanned documents. The current version provides
|
||||
the following engines.
|
||||
|
||||
| Engine | Installation | Usage |
|
||||
| ------ | ------------ | ----- |
|
||||
| [EasyOCR](https://github.com/JaidedAI/EasyOCR) | Default in Docling or via `pip install easyocr`. | `EasyOcrOptions` |
|
||||
| Tesseract | System dependency. See description for Tesseract and Tesserocr below. | `TesseractOcrOptions` |
|
||||
| Tesseract CLI | System dependency. See description below. | `TesseractCliOcrOptions` |
|
||||
|
||||
The Docling `DocumentConverter` allows to choose the OCR engine with the `ocr_options` settings. For example
|
||||
|
||||
```python
|
||||
from docling.datamodel.base_models import ConversionStatus, PipelineOptions
|
||||
from docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions
|
||||
from docling.document_converter import DocumentConverter
|
||||
|
||||
pipeline_options = PipelineOptions()
|
||||
pipeline_options.do_ocr = True
|
||||
pipeline_options.ocr_options = TesseractOcrOptions() # Use Tesseract
|
||||
|
||||
doc_converter = DocumentConverter(
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
```
|
||||
|
||||
#### Tesseract installation
|
||||
|
||||
[Tesseract](https://github.com/tesseract-ocr/tesseract) is a popular OCR engine which is available
|
||||
on most operating systems. For using this engine with Docling, Tesseract must be installed on your
|
||||
system, using the packaging tool of your choice. Below we provide example commands.
|
||||
After installing Tesseract you are expected to provide the path to its language files using the
|
||||
`TESSDATA_PREFIX` environment variable (note that it must terminate with a slash `/`).
|
||||
|
||||
For macOS, we reccomend using [Homebrew](https://brew.sh/).
|
||||
|
||||
```console
|
||||
brew install tesseract leptonica pkg-config
|
||||
TESSDATA_PREFIX=/opt/homebrew/share/tessdata/
|
||||
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
|
||||
```
|
||||
|
||||
For Debian-based systems.
|
||||
|
||||
```console
|
||||
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
|
||||
TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)
|
||||
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
|
||||
```
|
||||
|
||||
For RHEL systems.
|
||||
|
||||
```console
|
||||
dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel
|
||||
TESSDATA_PREFIX=/usr/share/tesseract/tessdata/
|
||||
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
|
||||
```
|
||||
|
||||
#### Linking to Tesseract
|
||||
The most efficient usage of the Tesseract library is via linking. Docling is using
|
||||
the [Tesserocr](https://github.com/sirfz/tesserocr) package for this.
|
||||
|
||||
If you get into installation issues of Tesserocr, we suggest using the following
|
||||
installation options:
|
||||
|
||||
```console
|
||||
pip uninstall tesserocr
|
||||
pip install --no-binary :all: tesserocr
|
||||
```
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Docling development setup</b></summary>
|
||||
|
||||
To develop for Docling (features, bugfixes etc.), install as follows from your local clone's root dir:
|
||||
```bash
|
||||
poetry install --all-extras
|
||||
```
|
||||
</details>
|
||||
More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.
|
||||
|
||||
## Getting started
|
||||
|
||||
### Convert a single document
|
||||
|
||||
To convert invidual PDF documents, use `convert_single()`, for example:
|
||||
To convert invidual documents, use `convert()`, for example:
|
||||
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter
|
||||
|
||||
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
||||
converter = DocumentConverter()
|
||||
result = converter.convert_single(source)
|
||||
result = converter.convert(source)
|
||||
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
||||
print(result.document.export_to_document_tokens()) # output: "<document><title><page_1><loc_20>..."
|
||||
```
|
||||
|
||||
### Convert a batch of documents
|
||||
|
||||
For an example of batch-converting documents, see [batch_convert.py](https://github.com/DS4SD/docling/blob/main/examples/batch_convert.py).
|
||||
|
||||
From a local repo clone, you can run it with:
|
||||
|
||||
```
|
||||
python examples/batch_convert.py
|
||||
```
|
||||
The output of the above command will be written to `./scratch`.
|
||||
|
||||
### CLI
|
||||
|
||||
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
|
||||
|
||||
A simple example would look like this:
|
||||
```console
|
||||
docling https://arxiv.org/pdf/2206.01062
|
||||
```
|
||||
|
||||
To see all available options (export formats etc.) run `docling --help`.
|
||||
|
||||
<details>
|
||||
<summary><b>CLI reference</b></summary>
|
||||
|
||||
Here are the available options as of this writing (for an up-to-date listing, run `docling --help`):
|
||||
|
||||
```console
|
||||
$ docling --help
|
||||
|
||||
Usage: docling [OPTIONS] source
|
||||
|
||||
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │
|
||||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ --json --no-json If enabled the document is exported as JSON. [default: no-json] │
|
||||
│ --md --no-md If enabled the document is exported as Markdown. [default: md] │
|
||||
│ --txt --no-txt If enabled the document is exported as Text. [default: no-txt] │
|
||||
│ --doctags --no-doctags If enabled the document is exported as Doc Tags. [default: no-doctags] │
|
||||
│ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │
|
||||
│ --backend [pypdfium2|docling] The PDF backend to use. [default: docling] │
|
||||
│ --output PATH Output directory where results are saved. [default: .] │
|
||||
│ --version Show version information. │
|
||||
│ --help Show this message and exit. │
|
||||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
```
|
||||
</details>
|
||||
|
||||
### RAG
|
||||
Check out the following examples showcasing RAG using Docling with standard LLM application frameworks:
|
||||
- [Basic RAG pipeline with LlamaIndex 🦙](https://github.com/DS4SD/docling/tree/main/docs/examples/rag_llamaindex.ipynb)
|
||||
- [Basic RAG pipeline with LangChain 🦜🔗](https://github.com/DS4SD/docling/tree/main/docs/examples/rag_langchain.ipynb)
|
||||
|
||||
## Advanced features
|
||||
|
||||
### Adjust pipeline features
|
||||
|
||||
The example file [custom_convert.py](https://github.com/DS4SD/docling/blob/main/examples/custom_convert.py) contains multiple ways
|
||||
one can adjust the conversion pipeline and features.
|
||||
Check out [Getting started](https://ds4sd.github.io/docling/).
|
||||
You will find lots of tuning options to leverage all the advanced capabilities.
|
||||
|
||||
|
||||
#### Control pipeline options
|
||||
## Get help and support
|
||||
|
||||
You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
|
||||
```python
|
||||
doc_converter = DocumentConverter(
|
||||
artifacts_path=artifacts_path,
|
||||
pipeline_options=PipelineOptions(
|
||||
do_table_structure=False, # controls if table structure is recovered
|
||||
do_ocr=True, # controls if OCR is applied (ignores programmatic content)
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
#### Control table extraction options
|
||||
|
||||
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
|
||||
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
|
||||
|
||||
|
||||
```python
|
||||
from docling.datamodel.pipeline_options import PipelineOptions
|
||||
|
||||
pipeline_options = PipelineOptions(do_table_structure=True)
|
||||
pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
|
||||
|
||||
doc_converter = DocumentConverter(
|
||||
artifacts_path=artifacts_path,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
```
|
||||
|
||||
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
|
||||
|
||||
```python
|
||||
from docling.datamodel.pipeline_options import PipelineOptions, TableFormerMode
|
||||
|
||||
pipeline_options = PipelineOptions(do_table_structure=True)
|
||||
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
|
||||
|
||||
doc_converter = DocumentConverter(
|
||||
artifacts_path=artifacts_path,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
```
|
||||
|
||||
### Impose limits on the document size
|
||||
|
||||
You can limit the file size and number of pages which should be allowed to process per document:
|
||||
```python
|
||||
conv_input = DocumentConversionInput.from_paths(
|
||||
paths=[Path("./test/data/2206.01062.pdf")],
|
||||
limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
|
||||
)
|
||||
```
|
||||
|
||||
### Convert from binary PDF streams
|
||||
|
||||
You can convert PDFs from a binary stream instead of from the filesystem as follows:
|
||||
|
||||
```python
|
||||
buf = BytesIO(your_binary_stream)
|
||||
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
|
||||
conv_input = DocumentConversionInput.from_streams(docs)
|
||||
results = doc_converter.convert_batch(conv_input)
|
||||
```
|
||||
### Limit resource usage
|
||||
|
||||
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
||||
|
||||
### Chunking
|
||||
|
||||
You can perform a hierarchy-aware chunking of a Docling document as follows:
|
||||
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter
|
||||
from docling_core.transforms.chunker import HierarchicalChunker
|
||||
|
||||
doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").legacy_document
|
||||
chunks = list(HierarchicalChunker().chunk(doc))
|
||||
print(chunks[0])
|
||||
# ChunkWithMetadata(
|
||||
# path='#/main-text/1',
|
||||
# text='DocLayNet: A Large Human-Annotated Dataset [...]',
|
||||
# page=1,
|
||||
# bbox=[107.30, 672.38, 505.19, 709.08],
|
||||
# [...]
|
||||
# )
|
||||
```
|
||||
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
|
||||
|
||||
|
||||
## Technical report
|
||||
|
@ -1,5 +1,6 @@
|
||||
With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
|
||||
pydantic datatype, which can express several features common to documents, such as:
|
||||
|
||||
* Text, Tables, Pictures, and more
|
||||
* Document hierarchy with sections and groups
|
||||
* Disambiguation between main body and headers, footers (furniture)
|
||||
|
@ -17,13 +17,13 @@
|
||||
[](https://github.com/pre-commit/pre-commit)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
|
||||
Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
|
||||
Docling parses documents and exports them to the desired format with ease and speed.
|
||||
|
||||
## Features
|
||||
|
||||
* ⚡ Converts PDF, Word, Powerpoint or HTML documents to JSON or Markdown format, stable and lightning fast
|
||||
* 📑 Understands detailed page layout, reading order and recovers table structures
|
||||
* 📝 Extracts metadata from the document, such as title, authors, references and language
|
||||
* 🔍 Includes OCR support for scanned PDFs or image formats
|
||||
* 🤖 Integrates easily with LLM app / RAG frameworks like LlamaIndex 🦙 & LangChain 🦜🔗
|
||||
* 💻 Provides a simple and convenient CLI
|
||||
* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
|
||||
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
|
||||
* 📝 Metadata extraction, including title, authors, references & language
|
||||
* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
|
||||
* 🔍 OCR support for scanned PDFs
|
||||
* 💻 Simple and convenient CLI
|
||||
|
@ -1,6 +1,7 @@
|
||||
## What's new
|
||||
|
||||
Docling v2 introduces several new features:
|
||||
|
||||
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
|
||||
- Produces a new, universal document representation which can encapsulate document hierarchy
|
||||
- Comes with a fresh new API and CLI
|
||||
@ -29,6 +30,7 @@ docling ./input/dir --output ./scratch --abort-on-error
|
||||
```
|
||||
|
||||
**Notable changes from Docling v1:**
|
||||
|
||||
- The standalone switches for different export formats are removed, and replaced with `--from` and `--to` arguments, to define input and output formats respectively.
|
||||
- The new `--abort-on-error` will abort any batch conversion as soon an error is encountered
|
||||
- The `--backend` option for PDFs was removed
|
||||
@ -168,6 +170,7 @@ conv_result.legacy_document # provides the representation in previous ExportedCC
|
||||
## Export into JSON, Markdown, Doctags
|
||||
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
|
||||
and are now available on `DoclingDocument` as:
|
||||
|
||||
- `DoclingDocument.export_to_dict`
|
||||
- `DoclingDocument.export_to_markdown`
|
||||
- `DoclingDocument.export_to_document_tokens`
|
||||
|
@ -54,6 +54,7 @@ nav:
|
||||
- Get started:
|
||||
- Home: index.md
|
||||
- Installation: installation.md
|
||||
- Use Docling: use_docling.md
|
||||
- Docling v2: v2.md
|
||||
- Concepts:
|
||||
- The Docling Document format: concepts/docling_format.md
|
||||
|
Loading…
Reference in New Issue
Block a user