diff --git a/README.md b/README.md index 9e20e86e..099a09e6 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,7 @@ # Docling [![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869) +[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/) [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/) ![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue) [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/) @@ -16,15 +17,19 @@ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) -Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package. +Docling parses documents and exports them to the desired format with ease and speed. ## Features -* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast -* 📑 Understands detailed page layout, reading order and recovers table structures -* 📝 Extracts metadata from the document, such as title, authors, references and language -* 🔍 Includes OCR support for scanned PDFs -* 🤖 Integrates easily with LLM app / RAG frameworks like 🦙 LlamaIndex and 🦜🔗 LangChain -* 💻 Provides a simple and convenient CLI + +* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.) +* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures +* 📝 Metadata extraction, including title, authors, references & language +* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications +* 🔍 OCR support for scanned PDFs +* 💻 Simple and convenient CLI + +Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling! + ## Installation @@ -35,271 +40,30 @@ pip install docling Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures. -
- Alternative PyTorch distributions - - The Docling models depend on the [PyTorch](https://pytorch.org/) library. - Depending on your architecture, you might want to use a different distribution of `torch`. - For example, you might want support for different accelerator or for a cpu-only version. - All the different ways for installing `torch` are listed on their website . - - One common situation is the installation on Linux systems with cpu-only support. - In this case, we suggest the installation of Docling with the following options - - ```bash - # Example for installing on the Linux cpu-only version - pip install docling --extra-index-url https://download.pytorch.org/whl/cpu - ``` -
- -
- Alternative OCR engines - - Docling supports multiple OCR engines for processing scanned documents. The current version provides - the following engines. - - | Engine | Installation | Usage | - | ------ | ------------ | ----- | - | [EasyOCR](https://github.com/JaidedAI/EasyOCR) | Default in Docling or via `pip install easyocr`. | `EasyOcrOptions` | - | Tesseract | System dependency. See description for Tesseract and Tesserocr below. | `TesseractOcrOptions` | - | Tesseract CLI | System dependency. See description below. | `TesseractCliOcrOptions` | - - The Docling `DocumentConverter` allows to choose the OCR engine with the `ocr_options` settings. For example - - ```python - from docling.datamodel.base_models import ConversionStatus, PipelineOptions - from docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions - from docling.document_converter import DocumentConverter - - pipeline_options = PipelineOptions() - pipeline_options.do_ocr = True - pipeline_options.ocr_options = TesseractOcrOptions() # Use Tesseract - - doc_converter = DocumentConverter( - pipeline_options=pipeline_options, - ) - ``` - - #### Tesseract installation - - [Tesseract](https://github.com/tesseract-ocr/tesseract) is a popular OCR engine which is available - on most operating systems. For using this engine with Docling, Tesseract must be installed on your - system, using the packaging tool of your choice. Below we provide example commands. - After installing Tesseract you are expected to provide the path to its language files using the - `TESSDATA_PREFIX` environment variable (note that it must terminate with a slash `/`). - - For macOS, we reccomend using [Homebrew](https://brew.sh/). - - ```console - brew install tesseract leptonica pkg-config - TESSDATA_PREFIX=/opt/homebrew/share/tessdata/ - echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}" - ``` - - For Debian-based systems. - - ```console - apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config - TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$) - echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}" - ``` - - For RHEL systems. - - ```console - dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel - TESSDATA_PREFIX=/usr/share/tesseract/tessdata/ - echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}" - ``` - - #### Linking to Tesseract - The most efficient usage of the Tesseract library is via linking. Docling is using - the [Tesserocr](https://github.com/sirfz/tesserocr) package for this. - - If you get into installation issues of Tesserocr, we suggest using the following - installation options: - - ```console - pip uninstall tesserocr - pip install --no-binary :all: tesserocr - ``` -
- -
- Docling development setup - - To develop for Docling (features, bugfixes etc.), install as follows from your local clone's root dir: - ```bash - poetry install --all-extras - ``` -
+More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs. ## Getting started -### Convert a single document - -To convert invidual PDF documents, use `convert_single()`, for example: +To convert invidual documents, use `convert()`, for example: ```python from docling.document_converter import DocumentConverter source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL converter = DocumentConverter() -result = converter.convert_single(source) +result = converter.convert(source) print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]" print(result.document.export_to_document_tokens()) # output: "<page_1><loc_20>..." ``` -### Convert a batch of documents -For an example of batch-converting documents, see [batch_convert.py](https://github.com/DS4SD/docling/blob/main/examples/batch_convert.py). - -From a local repo clone, you can run it with: - -``` -python examples/batch_convert.py -``` -The output of the above command will be written to `./scratch`. - -### CLI - -You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories. - -A simple example would look like this: -```console -docling https://arxiv.org/pdf/2206.01062 -``` - -To see all available options (export formats etc.) run `docling --help`. - -<details> - <summary><b>CLI reference</b></summary> - - Here are the available options as of this writing (for an up-to-date listing, run `docling --help`): - - ```console - $ docling --help - - Usage: docling [OPTIONS] source - - ╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ - │ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │ - ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ - ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ - │ --json --no-json If enabled the document is exported as JSON. [default: no-json] │ - │ --md --no-md If enabled the document is exported as Markdown. [default: md] │ - │ --txt --no-txt If enabled the document is exported as Text. [default: no-txt] │ - │ --doctags --no-doctags If enabled the document is exported as Doc Tags. [default: no-doctags] │ - │ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │ - │ --backend [pypdfium2|docling] The PDF backend to use. [default: docling] │ - │ --output PATH Output directory where results are saved. [default: .] │ - │ --version Show version information. │ - │ --help Show this message and exit. │ - ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ - ``` -</details> - -### RAG -Check out the following examples showcasing RAG using Docling with standard LLM application frameworks: -- [Basic RAG pipeline with LlamaIndex 🦙](https://github.com/DS4SD/docling/tree/main/docs/examples/rag_llamaindex.ipynb) -- [Basic RAG pipeline with LangChain 🦜🔗](https://github.com/DS4SD/docling/tree/main/docs/examples/rag_langchain.ipynb) - -## Advanced features - -### Adjust pipeline features - -The example file [custom_convert.py](https://github.com/DS4SD/docling/blob/main/examples/custom_convert.py) contains multiple ways -one can adjust the conversion pipeline and features. +Check out [Getting started](https://ds4sd.github.io/docling/). +You will find lots of tuning options to leverage all the advanced capabilities. -#### Control pipeline options +## Get help and support -You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`: -```python -doc_converter = DocumentConverter( - artifacts_path=artifacts_path, - pipeline_options=PipelineOptions( - do_table_structure=False, # controls if table structure is recovered - do_ocr=True, # controls if OCR is applied (ignores programmatic content) - ), -) -``` - -#### Control table extraction options - -You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. -This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one. - - -```python -from docling.datamodel.pipeline_options import PipelineOptions - -pipeline_options = PipelineOptions(do_table_structure=True) -pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model - -doc_converter = DocumentConverter( - artifacts_path=artifacts_path, - pipeline_options=pipeline_options, -) -``` - -Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures. - -```python -from docling.datamodel.pipeline_options import PipelineOptions, TableFormerMode - -pipeline_options = PipelineOptions(do_table_structure=True) -pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model - -doc_converter = DocumentConverter( - artifacts_path=artifacts_path, - pipeline_options=pipeline_options, -) -``` - -### Impose limits on the document size - -You can limit the file size and number of pages which should be allowed to process per document: -```python -conv_input = DocumentConversionInput.from_paths( - paths=[Path("./test/data/2206.01062.pdf")], - limits=DocumentLimits(max_num_pages=100, max_file_size=20971520) -) -``` - -### Convert from binary PDF streams - -You can convert PDFs from a binary stream instead of from the filesystem as follows: - -```python -buf = BytesIO(your_binary_stream) -docs = [DocumentStream(filename="my_doc.pdf", stream=buf)] -conv_input = DocumentConversionInput.from_streams(docs) -results = doc_converter.convert_batch(conv_input) -``` -### Limit resource usage - -You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. - -### Chunking - -You can perform a hierarchy-aware chunking of a Docling document as follows: - -```python -from docling.document_converter import DocumentConverter -from docling_core.transforms.chunker import HierarchicalChunker - -doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").legacy_document -chunks = list(HierarchicalChunker().chunk(doc)) -print(chunks[0]) -# ChunkWithMetadata( -# path='#/main-text/1', -# text='DocLayNet: A Large Human-Annotated Dataset [...]', -# page=1, -# bbox=[107.30, 672.38, 505.19, 709.08], -# [...] -# ) -``` +Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions). ## Technical report diff --git a/docs/concepts/docling_format.md b/docs/concepts/docling_format.md index ab7933b0..0e84e44f 100644 --- a/docs/concepts/docling_format.md +++ b/docs/concepts/docling_format.md @@ -1,5 +1,6 @@ With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a pydantic datatype, which can express several features common to documents, such as: + * Text, Tables, Pictures, and more * Document hierarchy with sections and groups * Disambiguation between main body and headers, footers (furniture) diff --git a/docs/index.md b/docs/index.md index bc96c9a7..146a5d49 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,13 +17,13 @@ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) -Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package. +Docling parses documents and exports them to the desired format with ease and speed. ## Features -* ⚡ Converts PDF, Word, Powerpoint or HTML documents to JSON or Markdown format, stable and lightning fast -* 📑 Understands detailed page layout, reading order and recovers table structures -* 📝 Extracts metadata from the document, such as title, authors, references and language -* 🔍 Includes OCR support for scanned PDFs or image formats -* 🤖 Integrates easily with LLM app / RAG frameworks like LlamaIndex 🦙 & LangChain 🦜🔗 -* 💻 Provides a simple and convenient CLI +* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.) +* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures +* 📝 Metadata extraction, including title, authors, references & language +* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications +* 🔍 OCR support for scanned PDFs +* 💻 Simple and convenient CLI diff --git a/docs/v2.md b/docs/v2.md index 6c11f6bc..0e513ecc 100644 --- a/docs/v2.md +++ b/docs/v2.md @@ -1,6 +1,7 @@ ## What's new Docling v2 introduces several new features: + - Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats - Produces a new, universal document representation which can encapsulate document hierarchy - Comes with a fresh new API and CLI @@ -29,6 +30,7 @@ docling ./input/dir --output ./scratch --abort-on-error ``` **Notable changes from Docling v1:** + - The standalone switches for different export formats are removed, and replaced with `--from` and `--to` arguments, to define input and output formats respectively. - The new `--abort-on-error` will abort any batch conversion as soon an error is encountered - The `--backend` option for PDFs was removed @@ -168,6 +170,7 @@ conv_result.legacy_document # provides the representation in previous ExportedCC ## Export into JSON, Markdown, Doctags **Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2, and are now available on `DoclingDocument` as: + - `DoclingDocument.export_to_dict` - `DoclingDocument.export_to_markdown` - `DoclingDocument.export_to_document_tokens` diff --git a/mkdocs.yml b/mkdocs.yml index 3835e250..5fd180a4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -54,6 +54,7 @@ nav: - Get started: - Home: index.md - Installation: installation.md + - Use Docling: use_docling.md - Docling v2: v2.md - Concepts: - The Docling Document format: concepts/docling_format.md