diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md
new file mode 100644
index 00000000..2a03212b
--- /dev/null
+++ b/docs/getting_started/index.md
@@ -0,0 +1,17 @@
+🐣 Ready to kick off your Docling journey? Let's dive right into it!
+
+
+
+## What's next
+
+🚀 The journey has just begun! Join us and become a part of the growing Docling community!
+
+- :fontawesome-brands-github: GitHub
+- :fontawesome-brands-linkedin: LinkedIn
diff --git a/docs/index.md b/docs/index.md
index 768612ad..279d084c 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -40,16 +40,11 @@ Docling simplifies document processing, parsing diverse formats — including ad
## Get started
-
+Check out our [getting started](./getting_started/index.md) page to get the ball rolling!
## Live assistant
-Do you want to leverage the power of AI and get a live support on Docling?
+Do you want to leverage the power of AI and get live support on Docling?
Try out the [Chat with Dosu](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github) functionalities provided by our friends at [Dosu](https://dosu.dev/).
[](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github)
diff --git a/docs/usage/advanced_options.md b/docs/usage/advanced_options.md
new file mode 100644
index 00000000..fbe9362a
--- /dev/null
+++ b/docs/usage/advanced_options.md
@@ -0,0 +1,192 @@
+## Model prefetching and offline usage
+
+By default, models are downloaded automatically upon first usage. If you would prefer
+to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do
+that as follows:
+
+**Step 1: Prefetch the models**
+
+Use the `docling-tools models download` utility:
+
+```sh
+$ docling-tools models download
+Downloading layout model...
+Downloading tableformer model...
+Downloading picture classifier model...
+Downloading code formula model...
+Downloading easyocr models...
+Models downloaded into $HOME/.cache/docling/models.
+```
+
+Alternatively, models can be programmatically downloaded using `docling.utils.model_downloader.download_models()`.
+
+**Step 2: Use the prefetched models**
+
+```python
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions
+from docling.document_converter import DocumentConverter, PdfFormatOption
+
+artifacts_path = "/local/path/to/models"
+
+pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)
+doc_converter = DocumentConverter(
+ format_options={
+ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+ }
+)
+```
+
+Or using the CLI:
+
+```sh
+docling --artifacts-path="/local/path/to/models" FILE
+```
+
+Or using the `DOCLING_ARTIFACTS_PATH` environment variable:
+
+```sh
+export DOCLING_ARTIFACTS_PATH="/local/path/to/models"
+python my_docling_script.py
+```
+
+## Using remote services
+
+The main purpose of Docling is to run local models which are not sharing any user data with remote services.
+Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.
+
+In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.
+
+```py
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.document_converter import DocumentConverter, PdfFormatOption
+
+pipeline_options = PdfPipelineOptions(enable_remote_services=True)
+doc_converter = DocumentConverter(
+ format_options={
+ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+ }
+)
+```
+
+When the value `enable_remote_services=True` is not set, the system will raise an exception `OperationNotAllowed()`.
+
+_Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in [Model prefetching and offline usage](#model-prefetching-and-offline-usage)._
+
+### List of remote model services
+
+The options in this list require the explicit `enable_remote_services=True` when processing the documents.
+
+- `PictureDescriptionApiOptions`: Using vision models via API calls.
+
+
+## Adjust pipeline features
+
+The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways
+one can adjust the conversion pipeline and features.
+
+### Control PDF table extraction options
+
+You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
+This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
+
+
+```python
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+
+pipeline_options = PdfPipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
+
+doc_converter = DocumentConverter(
+ format_options={
+ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+ }
+)
+```
+
+Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.
+
+```python
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
+
+pipeline_options = PdfPipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
+
+doc_converter = DocumentConverter(
+ format_options={
+ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+ }
+)
+```
+
+
+## Impose limits on the document size
+
+You can limit the file size and number of pages which should be allowed to process per document:
+
+```python
+from pathlib import Path
+from docling.document_converter import DocumentConverter
+
+source = "https://arxiv.org/pdf/2408.09869"
+converter = DocumentConverter()
+result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
+```
+
+## Convert from binary PDF streams
+
+You can convert PDFs from a binary stream instead of from the filesystem as follows:
+
+```python
+from io import BytesIO
+from docling.datamodel.base_models import DocumentStream
+from docling.document_converter import DocumentConverter
+
+buf = BytesIO(your_binary_stream)
+source = DocumentStream(name="my_doc.pdf", stream=buf)
+converter = DocumentConverter()
+result = converter.convert(source)
+```
+
+## Limit resource usage
+
+You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
+
+
+## Use specific backend converters
+
+!!! note
+
+ This section discusses directly invoking a [backend](../concepts/architecture.md),
+ i.e. using a low-level API. This should only be done when necessary. For most cases,
+ using a `DocumentConverter` (high-level API) as discussed in the sections above
+ should suffice — and is the recommended way.
+
+By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)).
+You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
+Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
+
+```python
+import urllib.request
+from io import BytesIO
+from docling.backend.html_backend import HTMLDocumentBackend
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.document import InputDocument
+
+url = "https://en.wikipedia.org/wiki/Duck"
+text = urllib.request.urlopen(url).read()
+in_doc = InputDocument(
+ path_or_stream=BytesIO(text),
+ format=InputFormat.HTML,
+ backend=HTMLDocumentBackend,
+ filename="duck.html",
+)
+backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
+dl_doc = backend.convert()
+print(dl_doc.export_to_markdown())
+```
diff --git a/docs/usage/index.md b/docs/usage/index.md
index 37972e35..0eeb94a9 100644
--- a/docs/usage/index.md
+++ b/docs/usage/index.md
@@ -1,262 +1,45 @@
-## Conversion
+## Basic usage
-### Convert a single document
+### Python
-To convert individual PDF documents, use `convert()`, for example:
+In Docling, working with documents is as simple as:
+
+1. converting your source file to a Docling document
+2. using that Docling document for your workflow
+
+For example, the snippet below shows conversion with export to Markdown:
```python
from docling.document_converter import DocumentConverter
-source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
+source = "https://arxiv.org/pdf/2408.09869" # file path or URL
converter = DocumentConverter()
-result = converter.convert(source)
-print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"
+doc = converter.convert(source).document
+
+print(doc.export_to_markdown()) # output: "### Docling Technical Report[...]"
```
+Docling supports a wide array of [file formats](./supported_formats.md) and, as outlined in the
+[architecture](../concepts/architecture.md) guide, provides a versatile document model along with a full suite of
+supported operations.
+
### CLI
-You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
+You can additionally use Docling directly from your terminal, for instance:
```console
docling https://arxiv.org/pdf/2206.01062
```
-You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
+
+The CLI provides various options, such as 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) (incl. MLX acceleration) & other VLMs:
```bash
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
```
-This will use MLX acceleration on supported Apple Silicon hardware.
+For all available options, run `docling --help` or check the [CLI reference](../reference/cli.md).
-To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md).
+## What's next
-### Advanced options
-
-#### Model prefetching and offline usage
-
-By default, models are downloaded automatically upon first usage. If you would prefer
-to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do
-that as follows:
-
-**Step 1: Prefetch the models**
-
-Use the `docling-tools models download` utility:
-
-```sh
-$ docling-tools models download
-Downloading layout model...
-Downloading tableformer model...
-Downloading picture classifier model...
-Downloading code formula model...
-Downloading easyocr models...
-Models downloaded into $HOME/.cache/docling/models.
-```
-
-Alternatively, models can be programmatically downloaded using `docling.utils.model_downloader.download_models()`.
-
-**Step 2: Use the prefetched models**
-
-```python
-from docling.datamodel.base_models import InputFormat
-from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions
-from docling.document_converter import DocumentConverter, PdfFormatOption
-
-artifacts_path = "/local/path/to/models"
-
-pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)
-doc_converter = DocumentConverter(
- format_options={
- InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
- }
-)
-```
-
-Or using the CLI:
-
-```sh
-docling --artifacts-path="/local/path/to/models" FILE
-```
-
-Or using the `DOCLING_ARTIFACTS_PATH` environment variable:
-
-```sh
-export DOCLING_ARTIFACTS_PATH="/local/path/to/models"
-python my_docling_script.py
-```
-
-#### Using remote services
-
-The main purpose of Docling is to run local models which are not sharing any user data with remote services.
-Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.
-
-In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.
-
-```py
-from docling.datamodel.base_models import InputFormat
-from docling.datamodel.pipeline_options import PdfPipelineOptions
-from docling.document_converter import DocumentConverter, PdfFormatOption
-
-pipeline_options = PdfPipelineOptions(enable_remote_services=True)
-doc_converter = DocumentConverter(
- format_options={
- InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
- }
-)
-```
-
-When the value `enable_remote_services=True` is not set, the system will raise an exception `OperationNotAllowed()`.
-
-_Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in [Model prefetching and offline usage](#model-prefetching-and-offline-usage)._
-
-##### List of remote model services
-
-The options in this list require the explicit `enable_remote_services=True` when processing the documents.
-
-- `PictureDescriptionApiOptions`: Using vision models via API calls.
-
-
-#### Adjust pipeline features
-
-The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways
-one can adjust the conversion pipeline and features.
-
-##### Control PDF table extraction options
-
-You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
-This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
-
-
-```python
-from docling.datamodel.base_models import InputFormat
-from docling.document_converter import DocumentConverter, PdfFormatOption
-from docling.datamodel.pipeline_options import PdfPipelineOptions
-
-pipeline_options = PdfPipelineOptions(do_table_structure=True)
-pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
-
-doc_converter = DocumentConverter(
- format_options={
- InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
- }
-)
-```
-
-Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.
-
-```python
-from docling.datamodel.base_models import InputFormat
-from docling.document_converter import DocumentConverter, PdfFormatOption
-from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
-
-pipeline_options = PdfPipelineOptions(do_table_structure=True)
-pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
-
-doc_converter = DocumentConverter(
- format_options={
- InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
- }
-)
-```
-
-
-#### Impose limits on the document size
-
-You can limit the file size and number of pages which should be allowed to process per document:
-
-```python
-from pathlib import Path
-from docling.document_converter import DocumentConverter
-
-source = "https://arxiv.org/pdf/2408.09869"
-converter = DocumentConverter()
-result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
-```
-
-#### Convert from binary PDF streams
-
-You can convert PDFs from a binary stream instead of from the filesystem as follows:
-
-```python
-from io import BytesIO
-from docling.datamodel.base_models import DocumentStream
-from docling.document_converter import DocumentConverter
-
-buf = BytesIO(your_binary_stream)
-source = DocumentStream(name="my_doc.pdf", stream=buf)
-converter = DocumentConverter()
-result = converter.convert(source)
-```
-
-#### Limit resource usage
-
-You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
-
-
-#### Use specific backend converters
-
-!!! note
-
- This section discusses directly invoking a [backend](../concepts/architecture.md),
- i.e. using a low-level API. This should only be done when necessary. For most cases,
- using a `DocumentConverter` (high-level API) as discussed in the sections above
- should suffice — and is the recommended way.
-
-By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)).
-You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
-Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
-
-```python
-import urllib.request
-from io import BytesIO
-from docling.backend.html_backend import HTMLDocumentBackend
-from docling.datamodel.base_models import InputFormat
-from docling.datamodel.document import InputDocument
-
-url = "https://en.wikipedia.org/wiki/Duck"
-text = urllib.request.urlopen(url).read()
-in_doc = InputDocument(
- path_or_stream=BytesIO(text),
- format=InputFormat.HTML,
- backend=HTMLDocumentBackend,
- filename="duck.html",
-)
-backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
-dl_doc = backend.convert()
-print(dl_doc.export_to_markdown())
-```
-
-## Chunking
-
-You can chunk a Docling document using a [chunker](../concepts/chunking.md), such as a
-`HybridChunker`, as shown below (for more details check out
-[this example](../examples/hybrid_chunking.ipynb)):
-
-```python
-from docling.document_converter import DocumentConverter
-from docling.chunking import HybridChunker
-
-conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
-doc = conv_res.document
-
-chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5") # set tokenizer as needed
-chunk_iter = chunker.chunk(doc)
-```
-
-An example chunk would look like this:
-
-```python
-print(list(chunk_iter)[11])
-# {
-# "text": "In this paper, we present the DocLayNet dataset. [...]",
-# "meta": {
-# "doc_items": [{
-# "self_ref": "#/texts/28",
-# "label": "text",
-# "prov": [{
-# "page_no": 2,
-# "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...},
-# }], ...,
-# }, ...],
-# "headings": ["1 INTRODUCTION"],
-# }
-# }
-```
+Check out the Usage subpages (navigation menu on the left) as well as our [featured examples](../examples/index.md) for
+additional usage workflows, including conversion customization, RAG, framework integrations, chunking, serialization,
+enrichments, and much more!
diff --git a/mkdocs.yml b/mkdocs.yml
index 036af90c..5f03c157 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -54,10 +54,13 @@ theme:
nav:
- Home:
- "Docling": index.md
+ - Getting started:
+ - Getting started: getting_started/index.md
- Installation:
- Installation: installation/index.md
- Usage:
- Usage: usage/index.md
+ - Advanced options: usage/advanced_options.md
- Supported formats: usage/supported_formats.md
- Enrichment features: usage/enrichments.md
- Vision models: usage/vision_models.md
@@ -156,6 +159,9 @@ markdown_extensions:
slugify: !!python/object/apply:pymdownx.slugs.slugify
kwds:
case: lower
+ - pymdownx.emoji:
+ emoji_index: !!python/name:material.extensions.emoji.twemoji
+ emoji_generator: !!python/name:material.extensions.emoji.to_svg
- admonition
- pymdownx.details
- attr_list
@@ -175,3 +181,10 @@ plugins:
extra_css:
- stylesheets/extra.css
+
+extra:
+ social:
+ - icon: fontawesome/brands/github
+ link: https://github.com/docling-project/docling
+ - icon: fontawesome/brands/linkedin
+ link: https://linkedin.com/company/docling/