diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md new file mode 100644 index 00000000..2a03212b --- /dev/null +++ b/docs/getting_started/index.md @@ -0,0 +1,17 @@ +🐣 Ready to kick off your Docling journey? Let's dive right into it! + +
+ ⬇️ Installation
Quickly install Docling in your environment
+ ▶️ Usage
Get a jumpstart on basic Docling usage
+ 🧩 Concepts
Learn Docling fundamentals and get a glimpse under the hood
+ 🧑🏽‍🍳 Examples
Try out recipes for various use cases, including conversion, RAG, and more
+ 🤖 Integrations
Check out integrations with popular AI tools and frameworks
+ 📖 Reference
See more API details
+
+ +## What's next + +🚀 The journey has just begun! Join us and become a part of the growing Docling community! + +- :fontawesome-brands-github: GitHub +- :fontawesome-brands-linkedin: LinkedIn diff --git a/docs/index.md b/docs/index.md index 768612ad..279d084c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,16 +40,11 @@ Docling simplifies document processing, parsing diverse formats — including ad ## Get started -
- Concepts
Learn Docling fundamentals
- Examples
Try out recipes for various use cases, including conversion, RAG, and more
- Integrations
Check out integrations with popular frameworks and tools
- Reference
See more API details
-
+Check out our [getting started](./getting_started/index.md) page to get the ball rolling! ## Live assistant -Do you want to leverage the power of AI and get a live support on Docling? +Do you want to leverage the power of AI and get live support on Docling? Try out the [Chat with Dosu](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github) functionalities provided by our friends at [Dosu](https://dosu.dev/). [![Chat with Dosu](https://dosu.dev/dosu-chat-badge.svg)](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github) diff --git a/docs/usage/advanced_options.md b/docs/usage/advanced_options.md new file mode 100644 index 00000000..fbe9362a --- /dev/null +++ b/docs/usage/advanced_options.md @@ -0,0 +1,192 @@ +## Model prefetching and offline usage + +By default, models are downloaded automatically upon first usage. If you would prefer +to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do +that as follows: + +**Step 1: Prefetch the models** + +Use the `docling-tools models download` utility: + +```sh +$ docling-tools models download +Downloading layout model... +Downloading tableformer model... +Downloading picture classifier model... +Downloading code formula model... +Downloading easyocr models... +Models downloaded into $HOME/.cache/docling/models. +``` + +Alternatively, models can be programmatically downloaded using `docling.utils.model_downloader.download_models()`. + +**Step 2: Use the prefetched models** + +```python +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions +from docling.document_converter import DocumentConverter, PdfFormatOption + +artifacts_path = "/local/path/to/models" + +pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path) +doc_converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) + } +) +``` + +Or using the CLI: + +```sh +docling --artifacts-path="/local/path/to/models" FILE +``` + +Or using the `DOCLING_ARTIFACTS_PATH` environment variable: + +```sh +export DOCLING_ARTIFACTS_PATH="/local/path/to/models" +python my_docling_script.py +``` + +## Using remote services + +The main purpose of Docling is to run local models which are not sharing any user data with remote services. +Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs. + +In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services. + +```py +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import PdfPipelineOptions +from docling.document_converter import DocumentConverter, PdfFormatOption + +pipeline_options = PdfPipelineOptions(enable_remote_services=True) +doc_converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) + } +) +``` + +When the value `enable_remote_services=True` is not set, the system will raise an exception `OperationNotAllowed()`. + +_Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in [Model prefetching and offline usage](#model-prefetching-and-offline-usage)._ + +### List of remote model services + +The options in this list require the explicit `enable_remote_services=True` when processing the documents. + +- `PictureDescriptionApiOptions`: Using vision models via API calls. + + +## Adjust pipeline features + +The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways +one can adjust the conversion pipeline and features. + +### Control PDF table extraction options + +You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. +This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one. + + +```python +from docling.datamodel.base_models import InputFormat +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling.datamodel.pipeline_options import PdfPipelineOptions + +pipeline_options = PdfPipelineOptions(do_table_structure=True) +pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model + +doc_converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) + } +) +``` + +Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures. + +```python +from docling.datamodel.base_models import InputFormat +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode + +pipeline_options = PdfPipelineOptions(do_table_structure=True) +pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model + +doc_converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) + } +) +``` + + +## Impose limits on the document size + +You can limit the file size and number of pages which should be allowed to process per document: + +```python +from pathlib import Path +from docling.document_converter import DocumentConverter + +source = "https://arxiv.org/pdf/2408.09869" +converter = DocumentConverter() +result = converter.convert(source, max_num_pages=100, max_file_size=20971520) +``` + +## Convert from binary PDF streams + +You can convert PDFs from a binary stream instead of from the filesystem as follows: + +```python +from io import BytesIO +from docling.datamodel.base_models import DocumentStream +from docling.document_converter import DocumentConverter + +buf = BytesIO(your_binary_stream) +source = DocumentStream(name="my_doc.pdf", stream=buf) +converter = DocumentConverter() +result = converter.convert(source) +``` + +## Limit resource usage + +You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. + + +## Use specific backend converters + +!!! note + + This section discusses directly invoking a [backend](../concepts/architecture.md), + i.e. using a low-level API. This should only be done when necessary. For most cases, + using a `DocumentConverter` (high-level API) as discussed in the sections above + should suffice — and is the recommended way. + +By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)). +You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example. +Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages: + +```python +import urllib.request +from io import BytesIO +from docling.backend.html_backend import HTMLDocumentBackend +from docling.datamodel.base_models import InputFormat +from docling.datamodel.document import InputDocument + +url = "https://en.wikipedia.org/wiki/Duck" +text = urllib.request.urlopen(url).read() +in_doc = InputDocument( + path_or_stream=BytesIO(text), + format=InputFormat.HTML, + backend=HTMLDocumentBackend, + filename="duck.html", +) +backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text)) +dl_doc = backend.convert() +print(dl_doc.export_to_markdown()) +``` diff --git a/docs/usage/index.md b/docs/usage/index.md index 37972e35..0eeb94a9 100644 --- a/docs/usage/index.md +++ b/docs/usage/index.md @@ -1,262 +1,45 @@ -## Conversion +## Basic usage -### Convert a single document +### Python -To convert individual PDF documents, use `convert()`, for example: +In Docling, working with documents is as simple as: + +1. converting your source file to a Docling document +2. using that Docling document for your workflow + +For example, the snippet below shows conversion with export to Markdown: ```python from docling.document_converter import DocumentConverter -source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL +source = "https://arxiv.org/pdf/2408.09869" # file path or URL converter = DocumentConverter() -result = converter.convert(source) -print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]" +doc = converter.convert(source).document + +print(doc.export_to_markdown()) # output: "### Docling Technical Report[...]" ``` +Docling supports a wide array of [file formats](./supported_formats.md) and, as outlined in the +[architecture](../concepts/architecture.md) guide, provides a versatile document model along with a full suite of +supported operations. + ### CLI -You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories. +You can additionally use Docling directly from your terminal, for instance: ```console docling https://arxiv.org/pdf/2206.01062 ``` -You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI: + +The CLI provides various options, such as 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) (incl. MLX acceleration) & other VLMs: ```bash docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062 ``` -This will use MLX acceleration on supported Apple Silicon hardware. +For all available options, run `docling --help` or check the [CLI reference](../reference/cli.md). -To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md). +## What's next -### Advanced options - -#### Model prefetching and offline usage - -By default, models are downloaded automatically upon first usage. If you would prefer -to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do -that as follows: - -**Step 1: Prefetch the models** - -Use the `docling-tools models download` utility: - -```sh -$ docling-tools models download -Downloading layout model... -Downloading tableformer model... -Downloading picture classifier model... -Downloading code formula model... -Downloading easyocr models... -Models downloaded into $HOME/.cache/docling/models. -``` - -Alternatively, models can be programmatically downloaded using `docling.utils.model_downloader.download_models()`. - -**Step 2: Use the prefetched models** - -```python -from docling.datamodel.base_models import InputFormat -from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions -from docling.document_converter import DocumentConverter, PdfFormatOption - -artifacts_path = "/local/path/to/models" - -pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path) -doc_converter = DocumentConverter( - format_options={ - InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) - } -) -``` - -Or using the CLI: - -```sh -docling --artifacts-path="/local/path/to/models" FILE -``` - -Or using the `DOCLING_ARTIFACTS_PATH` environment variable: - -```sh -export DOCLING_ARTIFACTS_PATH="/local/path/to/models" -python my_docling_script.py -``` - -#### Using remote services - -The main purpose of Docling is to run local models which are not sharing any user data with remote services. -Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs. - -In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services. - -```py -from docling.datamodel.base_models import InputFormat -from docling.datamodel.pipeline_options import PdfPipelineOptions -from docling.document_converter import DocumentConverter, PdfFormatOption - -pipeline_options = PdfPipelineOptions(enable_remote_services=True) -doc_converter = DocumentConverter( - format_options={ - InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) - } -) -``` - -When the value `enable_remote_services=True` is not set, the system will raise an exception `OperationNotAllowed()`. - -_Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in [Model prefetching and offline usage](#model-prefetching-and-offline-usage)._ - -##### List of remote model services - -The options in this list require the explicit `enable_remote_services=True` when processing the documents. - -- `PictureDescriptionApiOptions`: Using vision models via API calls. - - -#### Adjust pipeline features - -The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways -one can adjust the conversion pipeline and features. - -##### Control PDF table extraction options - -You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. -This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one. - - -```python -from docling.datamodel.base_models import InputFormat -from docling.document_converter import DocumentConverter, PdfFormatOption -from docling.datamodel.pipeline_options import PdfPipelineOptions - -pipeline_options = PdfPipelineOptions(do_table_structure=True) -pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model - -doc_converter = DocumentConverter( - format_options={ - InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) - } -) -``` - -Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures. - -```python -from docling.datamodel.base_models import InputFormat -from docling.document_converter import DocumentConverter, PdfFormatOption -from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode - -pipeline_options = PdfPipelineOptions(do_table_structure=True) -pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model - -doc_converter = DocumentConverter( - format_options={ - InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) - } -) -``` - - -#### Impose limits on the document size - -You can limit the file size and number of pages which should be allowed to process per document: - -```python -from pathlib import Path -from docling.document_converter import DocumentConverter - -source = "https://arxiv.org/pdf/2408.09869" -converter = DocumentConverter() -result = converter.convert(source, max_num_pages=100, max_file_size=20971520) -``` - -#### Convert from binary PDF streams - -You can convert PDFs from a binary stream instead of from the filesystem as follows: - -```python -from io import BytesIO -from docling.datamodel.base_models import DocumentStream -from docling.document_converter import DocumentConverter - -buf = BytesIO(your_binary_stream) -source = DocumentStream(name="my_doc.pdf", stream=buf) -converter = DocumentConverter() -result = converter.convert(source) -``` - -#### Limit resource usage - -You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. - - -#### Use specific backend converters - -!!! note - - This section discusses directly invoking a [backend](../concepts/architecture.md), - i.e. using a low-level API. This should only be done when necessary. For most cases, - using a `DocumentConverter` (high-level API) as discussed in the sections above - should suffice — and is the recommended way. - -By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)). -You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example. -Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages: - -```python -import urllib.request -from io import BytesIO -from docling.backend.html_backend import HTMLDocumentBackend -from docling.datamodel.base_models import InputFormat -from docling.datamodel.document import InputDocument - -url = "https://en.wikipedia.org/wiki/Duck" -text = urllib.request.urlopen(url).read() -in_doc = InputDocument( - path_or_stream=BytesIO(text), - format=InputFormat.HTML, - backend=HTMLDocumentBackend, - filename="duck.html", -) -backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text)) -dl_doc = backend.convert() -print(dl_doc.export_to_markdown()) -``` - -## Chunking - -You can chunk a Docling document using a [chunker](../concepts/chunking.md), such as a -`HybridChunker`, as shown below (for more details check out -[this example](../examples/hybrid_chunking.ipynb)): - -```python -from docling.document_converter import DocumentConverter -from docling.chunking import HybridChunker - -conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062") -doc = conv_res.document - -chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5") # set tokenizer as needed -chunk_iter = chunker.chunk(doc) -``` - -An example chunk would look like this: - -```python -print(list(chunk_iter)[11]) -# { -# "text": "In this paper, we present the DocLayNet dataset. [...]", -# "meta": { -# "doc_items": [{ -# "self_ref": "#/texts/28", -# "label": "text", -# "prov": [{ -# "page_no": 2, -# "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...}, -# }], ..., -# }, ...], -# "headings": ["1 INTRODUCTION"], -# } -# } -``` +Check out the Usage subpages (navigation menu on the left) as well as our [featured examples](../examples/index.md) for +additional usage workflows, including conversion customization, RAG, framework integrations, chunking, serialization, +enrichments, and much more! diff --git a/mkdocs.yml b/mkdocs.yml index 036af90c..5f03c157 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -54,10 +54,13 @@ theme: nav: - Home: - "Docling": index.md + - Getting started: + - Getting started: getting_started/index.md - Installation: - Installation: installation/index.md - Usage: - Usage: usage/index.md + - Advanced options: usage/advanced_options.md - Supported formats: usage/supported_formats.md - Enrichment features: usage/enrichments.md - Vision models: usage/vision_models.md @@ -156,6 +159,9 @@ markdown_extensions: slugify: !!python/object/apply:pymdownx.slugs.slugify kwds: case: lower + - pymdownx.emoji: + emoji_index: !!python/name:material.extensions.emoji.twemoji + emoji_generator: !!python/name:material.extensions.emoji.to_svg - admonition - pymdownx.details - attr_list @@ -175,3 +181,10 @@ plugins: extra_css: - stylesheets/extra.css + +extra: + social: + - icon: fontawesome/brands/github + link: https://github.com/docling-project/docling + - icon: fontawesome/brands/linkedin + link: https://linkedin.com/company/docling/