mirror of
https://github.com/DS4SD/docling.git
synced 2025-08-02 15:32:30 +00:00
docs: add documentation on supported formats and backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
This commit is contained in:
parent
a3b8414622
commit
550dbe1854
18
docs/examples/backend_xml_rag.ipynb
Normal file
18
docs/examples/backend_xml_rag.ipynb
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Converting US Patents from XML files for a RAG application"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"language_info": {
|
||||||
|
"name": "python"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
@ -24,6 +24,20 @@ docling https://arxiv.org/pdf/2206.01062
|
|||||||
|
|
||||||
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
|
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
|
||||||
|
|
||||||
|
### Supported formats
|
||||||
|
|
||||||
|
The document conversion in Docling supports several popular formats, including:
|
||||||
|
|
||||||
|
- **PDF** (Portable Document Format): the format developed by Adobe to present documents compatible across application software, hardware, and operating systems.
|
||||||
|
- **.docx**, **.xlsx**, **.pptx** (Word, Excel, and PowerPoint): the Open XML formats suppored by Microsof Office.
|
||||||
|
- **Markdown**: a lightweight markup language to add formatting elements to plain text documents.
|
||||||
|
- **AsciiDoc**: a plain text markup language for writing technical content.
|
||||||
|
- **HTML** (Hypertext Markup Language): the standard markup language for creating web pages.
|
||||||
|
- **XHTML** (Extensible Hypertext Markup Language): the XML-based version of HTML.
|
||||||
|
- **XML** (Extensible Markup Language): a markup format for storing and transmitting data. Due to its flexibility, Docling requires custom implementations to identify the
|
||||||
|
semantics of the data. Currently, Docling supports the parsing of [USPTO](https://www.uspto.gov/patents) patents and [PubMed Central® (PMC)](https://pmc.ncbi.nlm.nih.gov/) articles.
|
||||||
|
|
||||||
|
|
||||||
### Advanced options
|
### Advanced options
|
||||||
|
|
||||||
#### Adjust pipeline features
|
#### Adjust pipeline features
|
||||||
@ -126,6 +140,32 @@ result = converter.convert(source)
|
|||||||
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
||||||
|
|
||||||
|
|
||||||
|
#### Use specific backend converters
|
||||||
|
|
||||||
|
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)).
|
||||||
|
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
|
||||||
|
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import urllib.request
|
||||||
|
from io import BytesIO
|
||||||
|
from docling.backend.html_backend import HTMLDocumentBackend
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.document import InputDocument
|
||||||
|
|
||||||
|
url = "https://en.wikipedia.org/wiki/Duck"
|
||||||
|
text = urllib.request.urlopen(url).read()
|
||||||
|
in_doc = InputDocument(
|
||||||
|
path_or_stream=BytesIO(text),
|
||||||
|
format=InputFormat.HTML,
|
||||||
|
backend=HTMLDocumentBackend,
|
||||||
|
filename="duck.html",
|
||||||
|
)
|
||||||
|
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
|
||||||
|
result = backend.convert()
|
||||||
|
print(result.export_to_markdown())
|
||||||
|
```
|
||||||
|
|
||||||
## Chunking
|
## Chunking
|
||||||
|
|
||||||
You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
|
You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
|
||||||
|
Loading…
Reference in New Issue
Block a user