mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 12:48:28 +00:00
docs: add Getting Started page (#2113)
* docs: add Getting Started page Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * refactor usage Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * minor renaming Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
This commit is contained in:
17
docs/getting_started/index.md
vendored
Normal file
17
docs/getting_started/index.md
vendored
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
🐣 Ready to kick off your Docling journey? Let's dive right into it!
|
||||||
|
|
||||||
|
<div class="grid">
|
||||||
|
<a href="../installation/" class="card"><b>⬇️ Installation</b><br />Quickly install Docling in your environment</a>
|
||||||
|
<a href="../usage/" class="card"><b>▶️ Usage</b><br />Get a jumpstart on basic Docling usage</a>
|
||||||
|
<a href="../concepts/" class="card"><b>🧩 Concepts</b><br />Learn Docling fundamentals and get a glimpse under the hood</a>
|
||||||
|
<a href="../examples/" class="card"><b>🧑🏽🍳 Examples</b><br />Try out recipes for various use cases, including conversion, RAG, and more</a>
|
||||||
|
<a href="../integrations/" class="card"><b>🤖 Integrations</b><br />Check out integrations with popular AI tools and frameworks</a>
|
||||||
|
<a href="../reference/document_converter/" class="card"><b>📖 Reference</b><br />See more API details</a>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
## What's next
|
||||||
|
|
||||||
|
🚀 The journey has just begun! Join us and become a part of the growing Docling community!
|
||||||
|
|
||||||
|
- <a href="https://github.com/docling-project/docling">:fontawesome-brands-github: GitHub</a>
|
||||||
|
- <a href="https://linkedin.com/company/docling/">:fontawesome-brands-linkedin: LinkedIn</a>
|
||||||
9
docs/index.md
vendored
9
docs/index.md
vendored
@@ -40,16 +40,11 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
|||||||
|
|
||||||
## Get started
|
## Get started
|
||||||
|
|
||||||
<div class="grid">
|
Check out our [getting started](./getting_started/index.md) page to get the ball rolling!
|
||||||
<a href="concepts/" class="card"><b>Concepts</b><br />Learn Docling fundamentals</a>
|
|
||||||
<a href="examples/" class="card"><b>Examples</b><br />Try out recipes for various use cases, including conversion, RAG, and more</a>
|
|
||||||
<a href="integrations/" class="card"><b>Integrations</b><br />Check out integrations with popular frameworks and tools</a>
|
|
||||||
<a href="reference/document_converter/" class="card"><b>Reference</b><br />See more API details</a>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
## Live assistant
|
## Live assistant
|
||||||
|
|
||||||
Do you want to leverage the power of AI and get a live support on Docling?
|
Do you want to leverage the power of AI and get live support on Docling?
|
||||||
Try out the [Chat with Dosu](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github) functionalities provided by our friends at [Dosu](https://dosu.dev/).
|
Try out the [Chat with Dosu](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github) functionalities provided by our friends at [Dosu](https://dosu.dev/).
|
||||||
|
|
||||||
[](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github)
|
[](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github)
|
||||||
|
|||||||
192
docs/usage/advanced_options.md
vendored
Normal file
192
docs/usage/advanced_options.md
vendored
Normal file
@@ -0,0 +1,192 @@
|
|||||||
|
## Model prefetching and offline usage
|
||||||
|
|
||||||
|
By default, models are downloaded automatically upon first usage. If you would prefer
|
||||||
|
to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do
|
||||||
|
that as follows:
|
||||||
|
|
||||||
|
**Step 1: Prefetch the models**
|
||||||
|
|
||||||
|
Use the `docling-tools models download` utility:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
$ docling-tools models download
|
||||||
|
Downloading layout model...
|
||||||
|
Downloading tableformer model...
|
||||||
|
Downloading picture classifier model...
|
||||||
|
Downloading code formula model...
|
||||||
|
Downloading easyocr models...
|
||||||
|
Models downloaded into $HOME/.cache/docling/models.
|
||||||
|
```
|
||||||
|
|
||||||
|
Alternatively, models can be programmatically downloaded using `docling.utils.model_downloader.download_models()`.
|
||||||
|
|
||||||
|
**Step 2: Use the prefetched models**
|
||||||
|
|
||||||
|
```python
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions
|
||||||
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
|
|
||||||
|
artifacts_path = "/local/path/to/models"
|
||||||
|
|
||||||
|
pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)
|
||||||
|
doc_converter = DocumentConverter(
|
||||||
|
format_options={
|
||||||
|
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Or using the CLI:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
docling --artifacts-path="/local/path/to/models" FILE
|
||||||
|
```
|
||||||
|
|
||||||
|
Or using the `DOCLING_ARTIFACTS_PATH` environment variable:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
export DOCLING_ARTIFACTS_PATH="/local/path/to/models"
|
||||||
|
python my_docling_script.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Using remote services
|
||||||
|
|
||||||
|
The main purpose of Docling is to run local models which are not sharing any user data with remote services.
|
||||||
|
Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.
|
||||||
|
|
||||||
|
In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.
|
||||||
|
|
||||||
|
```py
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||||
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
|
|
||||||
|
pipeline_options = PdfPipelineOptions(enable_remote_services=True)
|
||||||
|
doc_converter = DocumentConverter(
|
||||||
|
format_options={
|
||||||
|
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
When the value `enable_remote_services=True` is not set, the system will raise an exception `OperationNotAllowed()`.
|
||||||
|
|
||||||
|
_Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in [Model prefetching and offline usage](#model-prefetching-and-offline-usage)._
|
||||||
|
|
||||||
|
### List of remote model services
|
||||||
|
|
||||||
|
The options in this list require the explicit `enable_remote_services=True` when processing the documents.
|
||||||
|
|
||||||
|
- `PictureDescriptionApiOptions`: Using vision models via API calls.
|
||||||
|
|
||||||
|
|
||||||
|
## Adjust pipeline features
|
||||||
|
|
||||||
|
The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways
|
||||||
|
one can adjust the conversion pipeline and features.
|
||||||
|
|
||||||
|
### Control PDF table extraction options
|
||||||
|
|
||||||
|
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
|
||||||
|
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
|
||||||
|
|
||||||
|
|
||||||
|
```python
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
|
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||||
|
|
||||||
|
pipeline_options = PdfPipelineOptions(do_table_structure=True)
|
||||||
|
pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
|
||||||
|
|
||||||
|
doc_converter = DocumentConverter(
|
||||||
|
format_options={
|
||||||
|
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
|
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
|
||||||
|
|
||||||
|
pipeline_options = PdfPipelineOptions(do_table_structure=True)
|
||||||
|
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
|
||||||
|
|
||||||
|
doc_converter = DocumentConverter(
|
||||||
|
format_options={
|
||||||
|
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Impose limits on the document size
|
||||||
|
|
||||||
|
You can limit the file size and number of pages which should be allowed to process per document:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from pathlib import Path
|
||||||
|
from docling.document_converter import DocumentConverter
|
||||||
|
|
||||||
|
source = "https://arxiv.org/pdf/2408.09869"
|
||||||
|
converter = DocumentConverter()
|
||||||
|
result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Convert from binary PDF streams
|
||||||
|
|
||||||
|
You can convert PDFs from a binary stream instead of from the filesystem as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from io import BytesIO
|
||||||
|
from docling.datamodel.base_models import DocumentStream
|
||||||
|
from docling.document_converter import DocumentConverter
|
||||||
|
|
||||||
|
buf = BytesIO(your_binary_stream)
|
||||||
|
source = DocumentStream(name="my_doc.pdf", stream=buf)
|
||||||
|
converter = DocumentConverter()
|
||||||
|
result = converter.convert(source)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Limit resource usage
|
||||||
|
|
||||||
|
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
||||||
|
|
||||||
|
|
||||||
|
## Use specific backend converters
|
||||||
|
|
||||||
|
!!! note
|
||||||
|
|
||||||
|
This section discusses directly invoking a [backend](../concepts/architecture.md),
|
||||||
|
i.e. using a low-level API. This should only be done when necessary. For most cases,
|
||||||
|
using a `DocumentConverter` (high-level API) as discussed in the sections above
|
||||||
|
should suffice — and is the recommended way.
|
||||||
|
|
||||||
|
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)).
|
||||||
|
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
|
||||||
|
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import urllib.request
|
||||||
|
from io import BytesIO
|
||||||
|
from docling.backend.html_backend import HTMLDocumentBackend
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.document import InputDocument
|
||||||
|
|
||||||
|
url = "https://en.wikipedia.org/wiki/Duck"
|
||||||
|
text = urllib.request.urlopen(url).read()
|
||||||
|
in_doc = InputDocument(
|
||||||
|
path_or_stream=BytesIO(text),
|
||||||
|
format=InputFormat.HTML,
|
||||||
|
backend=HTMLDocumentBackend,
|
||||||
|
filename="duck.html",
|
||||||
|
)
|
||||||
|
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
|
||||||
|
dl_doc = backend.convert()
|
||||||
|
print(dl_doc.export_to_markdown())
|
||||||
|
```
|
||||||
265
docs/usage/index.md
vendored
265
docs/usage/index.md
vendored
@@ -1,262 +1,45 @@
|
|||||||
## Conversion
|
## Basic usage
|
||||||
|
|
||||||
### Convert a single document
|
### Python
|
||||||
|
|
||||||
To convert individual PDF documents, use `convert()`, for example:
|
In Docling, working with documents is as simple as:
|
||||||
|
|
||||||
|
1. converting your source file to a Docling document
|
||||||
|
2. using that Docling document for your workflow
|
||||||
|
|
||||||
|
For example, the snippet below shows conversion with export to Markdown:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from docling.document_converter import DocumentConverter
|
from docling.document_converter import DocumentConverter
|
||||||
|
|
||||||
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
|
source = "https://arxiv.org/pdf/2408.09869" # file path or URL
|
||||||
converter = DocumentConverter()
|
converter = DocumentConverter()
|
||||||
result = converter.convert(source)
|
doc = converter.convert(source).document
|
||||||
print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"
|
|
||||||
|
print(doc.export_to_markdown()) # output: "### Docling Technical Report[...]"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Docling supports a wide array of [file formats](./supported_formats.md) and, as outlined in the
|
||||||
|
[architecture](../concepts/architecture.md) guide, provides a versatile document model along with a full suite of
|
||||||
|
supported operations.
|
||||||
|
|
||||||
### CLI
|
### CLI
|
||||||
|
|
||||||
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
|
You can additionally use Docling directly from your terminal, for instance:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
docling https://arxiv.org/pdf/2206.01062
|
docling https://arxiv.org/pdf/2206.01062
|
||||||
```
|
```
|
||||||
You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
|
|
||||||
|
The CLI provides various options, such as 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) (incl. MLX acceleration) & other VLMs:
|
||||||
```bash
|
```bash
|
||||||
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
|
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
|
||||||
```
|
```
|
||||||
This will use MLX acceleration on supported Apple Silicon hardware.
|
|
||||||
|
|
||||||
|
For all available options, run `docling --help` or check the [CLI reference](../reference/cli.md).
|
||||||
|
|
||||||
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md).
|
## What's next
|
||||||
|
|
||||||
### Advanced options
|
Check out the Usage subpages (navigation menu on the left) as well as our [featured examples](../examples/index.md) for
|
||||||
|
additional usage workflows, including conversion customization, RAG, framework integrations, chunking, serialization,
|
||||||
#### Model prefetching and offline usage
|
enrichments, and much more!
|
||||||
|
|
||||||
By default, models are downloaded automatically upon first usage. If you would prefer
|
|
||||||
to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do
|
|
||||||
that as follows:
|
|
||||||
|
|
||||||
**Step 1: Prefetch the models**
|
|
||||||
|
|
||||||
Use the `docling-tools models download` utility:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
$ docling-tools models download
|
|
||||||
Downloading layout model...
|
|
||||||
Downloading tableformer model...
|
|
||||||
Downloading picture classifier model...
|
|
||||||
Downloading code formula model...
|
|
||||||
Downloading easyocr models...
|
|
||||||
Models downloaded into $HOME/.cache/docling/models.
|
|
||||||
```
|
|
||||||
|
|
||||||
Alternatively, models can be programmatically downloaded using `docling.utils.model_downloader.download_models()`.
|
|
||||||
|
|
||||||
**Step 2: Use the prefetched models**
|
|
||||||
|
|
||||||
```python
|
|
||||||
from docling.datamodel.base_models import InputFormat
|
|
||||||
from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions
|
|
||||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
|
||||||
|
|
||||||
artifacts_path = "/local/path/to/models"
|
|
||||||
|
|
||||||
pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)
|
|
||||||
doc_converter = DocumentConverter(
|
|
||||||
format_options={
|
|
||||||
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
|
||||||
}
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Or using the CLI:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
docling --artifacts-path="/local/path/to/models" FILE
|
|
||||||
```
|
|
||||||
|
|
||||||
Or using the `DOCLING_ARTIFACTS_PATH` environment variable:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
export DOCLING_ARTIFACTS_PATH="/local/path/to/models"
|
|
||||||
python my_docling_script.py
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Using remote services
|
|
||||||
|
|
||||||
The main purpose of Docling is to run local models which are not sharing any user data with remote services.
|
|
||||||
Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.
|
|
||||||
|
|
||||||
In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.
|
|
||||||
|
|
||||||
```py
|
|
||||||
from docling.datamodel.base_models import InputFormat
|
|
||||||
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
|
||||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
|
||||||
|
|
||||||
pipeline_options = PdfPipelineOptions(enable_remote_services=True)
|
|
||||||
doc_converter = DocumentConverter(
|
|
||||||
format_options={
|
|
||||||
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
|
||||||
}
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
When the value `enable_remote_services=True` is not set, the system will raise an exception `OperationNotAllowed()`.
|
|
||||||
|
|
||||||
_Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in [Model prefetching and offline usage](#model-prefetching-and-offline-usage)._
|
|
||||||
|
|
||||||
##### List of remote model services
|
|
||||||
|
|
||||||
The options in this list require the explicit `enable_remote_services=True` when processing the documents.
|
|
||||||
|
|
||||||
- `PictureDescriptionApiOptions`: Using vision models via API calls.
|
|
||||||
|
|
||||||
|
|
||||||
#### Adjust pipeline features
|
|
||||||
|
|
||||||
The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways
|
|
||||||
one can adjust the conversion pipeline and features.
|
|
||||||
|
|
||||||
##### Control PDF table extraction options
|
|
||||||
|
|
||||||
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
|
|
||||||
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
|
|
||||||
|
|
||||||
|
|
||||||
```python
|
|
||||||
from docling.datamodel.base_models import InputFormat
|
|
||||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
|
||||||
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
|
||||||
|
|
||||||
pipeline_options = PdfPipelineOptions(do_table_structure=True)
|
|
||||||
pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
|
|
||||||
|
|
||||||
doc_converter = DocumentConverter(
|
|
||||||
format_options={
|
|
||||||
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
|
||||||
}
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.
|
|
||||||
|
|
||||||
```python
|
|
||||||
from docling.datamodel.base_models import InputFormat
|
|
||||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
|
||||||
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
|
|
||||||
|
|
||||||
pipeline_options = PdfPipelineOptions(do_table_structure=True)
|
|
||||||
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
|
|
||||||
|
|
||||||
doc_converter = DocumentConverter(
|
|
||||||
format_options={
|
|
||||||
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
|
||||||
}
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
#### Impose limits on the document size
|
|
||||||
|
|
||||||
You can limit the file size and number of pages which should be allowed to process per document:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from pathlib import Path
|
|
||||||
from docling.document_converter import DocumentConverter
|
|
||||||
|
|
||||||
source = "https://arxiv.org/pdf/2408.09869"
|
|
||||||
converter = DocumentConverter()
|
|
||||||
result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Convert from binary PDF streams
|
|
||||||
|
|
||||||
You can convert PDFs from a binary stream instead of from the filesystem as follows:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from io import BytesIO
|
|
||||||
from docling.datamodel.base_models import DocumentStream
|
|
||||||
from docling.document_converter import DocumentConverter
|
|
||||||
|
|
||||||
buf = BytesIO(your_binary_stream)
|
|
||||||
source = DocumentStream(name="my_doc.pdf", stream=buf)
|
|
||||||
converter = DocumentConverter()
|
|
||||||
result = converter.convert(source)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Limit resource usage
|
|
||||||
|
|
||||||
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
|
||||||
|
|
||||||
|
|
||||||
#### Use specific backend converters
|
|
||||||
|
|
||||||
!!! note
|
|
||||||
|
|
||||||
This section discusses directly invoking a [backend](../concepts/architecture.md),
|
|
||||||
i.e. using a low-level API. This should only be done when necessary. For most cases,
|
|
||||||
using a `DocumentConverter` (high-level API) as discussed in the sections above
|
|
||||||
should suffice — and is the recommended way.
|
|
||||||
|
|
||||||
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)).
|
|
||||||
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
|
|
||||||
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
|
|
||||||
|
|
||||||
```python
|
|
||||||
import urllib.request
|
|
||||||
from io import BytesIO
|
|
||||||
from docling.backend.html_backend import HTMLDocumentBackend
|
|
||||||
from docling.datamodel.base_models import InputFormat
|
|
||||||
from docling.datamodel.document import InputDocument
|
|
||||||
|
|
||||||
url = "https://en.wikipedia.org/wiki/Duck"
|
|
||||||
text = urllib.request.urlopen(url).read()
|
|
||||||
in_doc = InputDocument(
|
|
||||||
path_or_stream=BytesIO(text),
|
|
||||||
format=InputFormat.HTML,
|
|
||||||
backend=HTMLDocumentBackend,
|
|
||||||
filename="duck.html",
|
|
||||||
)
|
|
||||||
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
|
|
||||||
dl_doc = backend.convert()
|
|
||||||
print(dl_doc.export_to_markdown())
|
|
||||||
```
|
|
||||||
|
|
||||||
## Chunking
|
|
||||||
|
|
||||||
You can chunk a Docling document using a [chunker](../concepts/chunking.md), such as a
|
|
||||||
`HybridChunker`, as shown below (for more details check out
|
|
||||||
[this example](../examples/hybrid_chunking.ipynb)):
|
|
||||||
|
|
||||||
```python
|
|
||||||
from docling.document_converter import DocumentConverter
|
|
||||||
from docling.chunking import HybridChunker
|
|
||||||
|
|
||||||
conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
|
|
||||||
doc = conv_res.document
|
|
||||||
|
|
||||||
chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5") # set tokenizer as needed
|
|
||||||
chunk_iter = chunker.chunk(doc)
|
|
||||||
```
|
|
||||||
|
|
||||||
An example chunk would look like this:
|
|
||||||
|
|
||||||
```python
|
|
||||||
print(list(chunk_iter)[11])
|
|
||||||
# {
|
|
||||||
# "text": "In this paper, we present the DocLayNet dataset. [...]",
|
|
||||||
# "meta": {
|
|
||||||
# "doc_items": [{
|
|
||||||
# "self_ref": "#/texts/28",
|
|
||||||
# "label": "text",
|
|
||||||
# "prov": [{
|
|
||||||
# "page_no": 2,
|
|
||||||
# "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...},
|
|
||||||
# }], ...,
|
|
||||||
# }, ...],
|
|
||||||
# "headings": ["1 INTRODUCTION"],
|
|
||||||
# }
|
|
||||||
# }
|
|
||||||
```
|
|
||||||
|
|||||||
13
mkdocs.yml
13
mkdocs.yml
@@ -54,10 +54,13 @@ theme:
|
|||||||
nav:
|
nav:
|
||||||
- Home:
|
- Home:
|
||||||
- "Docling": index.md
|
- "Docling": index.md
|
||||||
|
- Getting started:
|
||||||
|
- Getting started: getting_started/index.md
|
||||||
- Installation:
|
- Installation:
|
||||||
- Installation: installation/index.md
|
- Installation: installation/index.md
|
||||||
- Usage:
|
- Usage:
|
||||||
- Usage: usage/index.md
|
- Usage: usage/index.md
|
||||||
|
- Advanced options: usage/advanced_options.md
|
||||||
- Supported formats: usage/supported_formats.md
|
- Supported formats: usage/supported_formats.md
|
||||||
- Enrichment features: usage/enrichments.md
|
- Enrichment features: usage/enrichments.md
|
||||||
- Vision models: usage/vision_models.md
|
- Vision models: usage/vision_models.md
|
||||||
@@ -156,6 +159,9 @@ markdown_extensions:
|
|||||||
slugify: !!python/object/apply:pymdownx.slugs.slugify
|
slugify: !!python/object/apply:pymdownx.slugs.slugify
|
||||||
kwds:
|
kwds:
|
||||||
case: lower
|
case: lower
|
||||||
|
- pymdownx.emoji:
|
||||||
|
emoji_index: !!python/name:material.extensions.emoji.twemoji
|
||||||
|
emoji_generator: !!python/name:material.extensions.emoji.to_svg
|
||||||
- admonition
|
- admonition
|
||||||
- pymdownx.details
|
- pymdownx.details
|
||||||
- attr_list
|
- attr_list
|
||||||
@@ -175,3 +181,10 @@ plugins:
|
|||||||
|
|
||||||
extra_css:
|
extra_css:
|
||||||
- stylesheets/extra.css
|
- stylesheets/extra.css
|
||||||
|
|
||||||
|
extra:
|
||||||
|
social:
|
||||||
|
- icon: fontawesome/brands/github
|
||||||
|
link: https://github.com/docling-project/docling
|
||||||
|
- icon: fontawesome/brands/linkedin
|
||||||
|
link: https://linkedin.com/company/docling/
|
||||||
|
|||||||
Reference in New Issue
Block a user