mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-31 14:34:40 +00:00
minor reorg of top-level docs
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
This commit is contained in:
parent
712fab8235
commit
f587e3cee8
@ -123,6 +123,6 @@ For individual model usage, please refer to the model licenses found in the orig
|
|||||||
|
|
||||||
Docling has been brought to you by IBM.
|
Docling has been brought to you by IBM.
|
||||||
|
|
||||||
[supported_formats]: https://ds4sd.github.io/docling/supported_formats/
|
[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/
|
||||||
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
|
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
|
||||||
[integrations]: https://ds4sd.github.io/docling/integrations/
|
[integrations]: https://ds4sd.github.io/docling/integrations/
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
# WARNING
|
# WARNING
|
||||||
# This example demonstrates only how to develop a new enrichment model.
|
# This example demonstrates only how to develop a new enrichment model.
|
||||||
# It does not run thr actual formula understanding model.
|
# It does not run the actual formula understanding model.
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
# WARNING
|
# WARNING
|
||||||
# This example demonstrates only how to develop a new enrichment model.
|
# This example demonstrates only how to develop a new enrichment model.
|
||||||
# It does not run thr actual picture classifier model.
|
# It does not run the actual picture classifier model.
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
@ -149,7 +149,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
|||||||
|
|
||||||
**Details**:
|
**Details**:
|
||||||
|
|
||||||
Using the [`HybridChunker`](./concepts/chunking.md#hybrid-chunker) often triggers a warning like this:
|
Using the [`HybridChunker`](../concepts/chunking.md#hybrid-chunker) often triggers a warning like this:
|
||||||
> Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
|
> Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
|
||||||
|
|
||||||
This is a warning that is emitted by transformers, saying that actually *running this sequence through the model* will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.
|
This is a warning that is emitted by transformers, saying that actually *running this sequence through the model* will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.
|
@ -47,6 +47,6 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
|||||||
|
|
||||||
Docling has been brought to you by IBM.
|
Docling has been brought to you by IBM.
|
||||||
|
|
||||||
[supported_formats]: ./supported_formats.md
|
[supported_formats]: ./usage/supported_formats.md
|
||||||
[docling_document]: ./concepts/docling_document.md
|
[docling_document]: ./concepts/docling_document.md
|
||||||
[integrations]: ./integrations/index.md
|
[integrations]: ./integrations/index.md
|
||||||
|
@ -6,10 +6,10 @@ The following table provides an overview of the default enrichment models availa
|
|||||||
|
|
||||||
| Feature | Parameter | Processed item | Description |
|
| Feature | Parameter | Processed item | Description |
|
||||||
| ------- | --------- | ---------------| ----------- |
|
| ------- | --------- | ---------------| ----------- |
|
||||||
| Code understanding | `do_code_enrichment` | `CodeItem` | See [docs below](#code-understanding). |
|
| Code understanding | `do_code_enrichment` | `CodeItem` | See [docs below](#code-understanding). |
|
||||||
| Formula understanding | `do_formula_enrichment` | `TextItem` with label `FORMULA` | See [docs below](#formula-understanding). |
|
| Formula understanding | `do_formula_enrichment` | `TextItem` with label `FORMULA` | See [docs below](#formula-understanding). |
|
||||||
| Picrure classification | `do_picture_classification` | `PictureItem` | See [docs below](#picture-classification). |
|
| Picrure classification | `do_picture_classification` | `PictureItem` | See [docs below](#picture-classification). |
|
||||||
| Picture description | `do_picture_description` | `PictureItem` | See [docs below](#picture-description). |
|
| Picture description | `do_picture_description` | `PictureItem` | See [docs below](#picture-description). |
|
||||||
|
|
||||||
|
|
||||||
## Enrichments details
|
## Enrichments details
|
||||||
@ -204,7 +204,7 @@ pipeline_options.picture_description_options = PictureDescriptionApiOptions(
|
|||||||
|
|
||||||
End-to-end code snippets for cloud providers are available in the examples section:
|
End-to-end code snippets for cloud providers are available in the examples section:
|
||||||
|
|
||||||
- [IBM watsonx.ai](./examples/pictures_description_api.py)
|
- [IBM watsonx.ai](../examples/pictures_description_api.py)
|
||||||
|
|
||||||
|
|
||||||
## Develop new enrichment models
|
## Develop new enrichment models
|
||||||
@ -212,5 +212,5 @@ End-to-end code snippets for cloud providers are available in the examples secti
|
|||||||
Beside looking at the implementation of all the models listed above, the Docling documentation has a few examples
|
Beside looking at the implementation of all the models listed above, the Docling documentation has a few examples
|
||||||
dedicated to the implementation of enrichment models.
|
dedicated to the implementation of enrichment models.
|
||||||
|
|
||||||
- [Develop picture enrichment](./examples/develop_picture_enrichment.py)
|
- [Develop picture enrichment](../examples/develop_picture_enrichment.py)
|
||||||
- [Develop formula enrichment](./examples/develop_formula_understanding.py)
|
- [Develop formula enrichment](../examples/develop_formula_understanding.py)
|
@ -1,3 +1,4 @@
|
|||||||
|
|
||||||
## Conversion
|
## Conversion
|
||||||
|
|
||||||
### Convert a single document
|
### Convert a single document
|
||||||
@ -22,7 +23,7 @@ A simple example would look like this:
|
|||||||
docling https://arxiv.org/pdf/2206.01062
|
docling https://arxiv.org/pdf/2206.01062
|
||||||
```
|
```
|
||||||
|
|
||||||
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
|
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md).
|
||||||
|
|
||||||
### Advanced options
|
### Advanced options
|
||||||
|
|
||||||
@ -104,7 +105,7 @@ The options in this list require the explicit `enable_remote_services=True` when
|
|||||||
|
|
||||||
#### Adjust pipeline features
|
#### Adjust pipeline features
|
||||||
|
|
||||||
The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
|
The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways
|
||||||
one can adjust the conversion pipeline and features.
|
one can adjust the conversion pipeline and features.
|
||||||
|
|
||||||
##### Control PDF table extraction options
|
##### Control PDF table extraction options
|
||||||
@ -183,13 +184,13 @@ You can limit the CPU threads used by Docling by setting the environment variabl
|
|||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
|
|
||||||
This section discusses directly invoking a [backend](./concepts/architecture.md),
|
This section discusses directly invoking a [backend](../concepts/architecture.md),
|
||||||
i.e. using a low-level API. This should only be done when necessary. For most cases,
|
i.e. using a low-level API. This should only be done when necessary. For most cases,
|
||||||
using a `DocumentConverter` (high-level API) as discussed in the sections above
|
using a `DocumentConverter` (high-level API) as discussed in the sections above
|
||||||
should suffice — and is the recommended way.
|
should suffice — and is the recommended way.
|
||||||
|
|
||||||
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
|
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](../supported_formats.md)).
|
||||||
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
|
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
|
||||||
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
|
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@ -214,9 +215,9 @@ print(dl_doc.export_to_markdown())
|
|||||||
|
|
||||||
## Chunking
|
## Chunking
|
||||||
|
|
||||||
You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
|
You can chunk a Docling document using a [chunker](../concepts/chunking.md), such as a
|
||||||
`HybridChunker`, as shown below (for more details check out
|
`HybridChunker`, as shown below (for more details check out
|
||||||
[this example](examples/hybrid_chunking.ipynb)):
|
[this example](../examples/hybrid_chunking.ipynb)):
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from docling.document_converter import DocumentConverter
|
from docling.document_converter import DocumentConverter
|
@ -1,6 +1,6 @@
|
|||||||
Docling can parse various documents formats into a unified representation (Docling
|
Docling can parse various documents formats into a unified representation (Docling
|
||||||
Document), which it can export to different formats too — check out
|
Document), which it can export to different formats too — check out
|
||||||
[Architecture](./concepts/architecture.md) for more details.
|
[Architecture](../concepts/architecture.md) for more details.
|
||||||
|
|
||||||
Below you can find a listing of all supported input and output formats.
|
Below you can find a listing of all supported input and output formats.
|
||||||
|
|
||||||
@ -22,7 +22,7 @@ Schema-specific support:
|
|||||||
|--------|-------------|
|
|--------|-------------|
|
||||||
| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
|
| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
|
||||||
| JATS XML | XML format followed by [JATS](https://jats.nlm.nih.gov/) articles |
|
| JATS XML | XML format followed by [JATS](https://jats.nlm.nih.gov/) articles |
|
||||||
| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) |
|
| Docling JSON | JSON-serialized [Docling Document](../concepts/docling_document.md) |
|
||||||
|
|
||||||
## Supported output formats
|
## Supported output formats
|
||||||
|
|
23
mkdocs.yml
23
mkdocs.yml
@ -54,12 +54,14 @@ theme:
|
|||||||
nav:
|
nav:
|
||||||
- Home:
|
- Home:
|
||||||
- "Docling": index.md
|
- "Docling": index.md
|
||||||
- Installation: installation.md
|
- Installation:
|
||||||
- Usage: usage.md
|
- Installation: installation/index.md
|
||||||
- Supported formats: supported_formats.md
|
- Usage:
|
||||||
- Enrichment features: enrichments.md
|
- Usage: usage/index.md
|
||||||
- FAQ: faq.md
|
- Supported formats: usage/supported_formats.md
|
||||||
- Docling v2: v2.md
|
- Enrichment features: usage/enrichments.md
|
||||||
|
- FAQ:
|
||||||
|
- FAQ: faq/index.md
|
||||||
- Concepts:
|
- Concepts:
|
||||||
- Concepts: concepts/index.md
|
- Concepts: concepts/index.md
|
||||||
- Architecture: concepts/architecture.md
|
- Architecture: concepts/architecture.md
|
||||||
@ -73,11 +75,8 @@ nav:
|
|||||||
- "Batch conversion": examples/batch_convert.py
|
- "Batch conversion": examples/batch_convert.py
|
||||||
- "Multi-format conversion": examples/run_with_formats.py
|
- "Multi-format conversion": examples/run_with_formats.py
|
||||||
- "Figure export": examples/export_figures.py
|
- "Figure export": examples/export_figures.py
|
||||||
- "Figure enrichment": examples/develop_picture_enrichment.py
|
|
||||||
- "Table export": examples/export_tables.py
|
- "Table export": examples/export_tables.py
|
||||||
- "Multimodal export": examples/export_multimodal.py
|
- "Multimodal export": examples/export_multimodal.py
|
||||||
- "Annotate picture with local vlm": examples/pictures_description.ipynb
|
|
||||||
- "Annotate picture with remote vlm": examples/pictures_description_api.py
|
|
||||||
- "Force full page OCR": examples/full_page_ocr.py
|
- "Force full page OCR": examples/full_page_ocr.py
|
||||||
- "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py
|
- "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py
|
||||||
- "RapidOCR with custom OCR models": examples/rapidocr_with_custom_models.py
|
- "RapidOCR with custom OCR models": examples/rapidocr_with_custom_models.py
|
||||||
@ -91,6 +90,12 @@ nav:
|
|||||||
- examples/rag_haystack.ipynb
|
- examples/rag_haystack.ipynb
|
||||||
- examples/rag_langchain.ipynb
|
- examples/rag_langchain.ipynb
|
||||||
- examples/rag_llamaindex.ipynb
|
- examples/rag_llamaindex.ipynb
|
||||||
|
- 🖼️ Picture annotation:
|
||||||
|
- "Annotate picture with local VLM": examples/pictures_description.ipynb
|
||||||
|
- "Annotate picture with remote VLM": examples/pictures_description_api.py
|
||||||
|
- ✨ Enrichment development:
|
||||||
|
- "Figure enrichment": examples/develop_picture_enrichment.py
|
||||||
|
- "Formula enrichment": examples/develop_formula_understanding.py
|
||||||
- 🗂️ More examples:
|
- 🗂️ More examples:
|
||||||
- examples/rag_weaviate.ipynb
|
- examples/rag_weaviate.ipynb
|
||||||
- RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb
|
- RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb
|
||||||
|
Loading…
Reference in New Issue
Block a user