docs: add documentation on supported formats and backends

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-08-02 07:22:14 +00:00 · 2025-01-08 16:13:47 +01:00 · 2025-01-08 16:13:47 +01:00 · 550dbe1854
commit 550dbe1854
parent a3b8414622
2 changed files with 58 additions and 0 deletions
--- a/docs/examples/backend_xml_rag.ipynb
+++ b/docs/examples/backend_xml_rag.ipynb
@ -0,0 +1,18 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Converting US Patents from XML files for a RAG application"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/docs/usage.md
+++ b/docs/usage.md
@ -24,6 +24,20 @@ docling https://arxiv.org/pdf/2206.01062
 To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
 ### Supported formats
 The document conversion in Docling supports several popular formats, including:
 - **PDF** (Portable Document Format): the format developed by Adobe to present documents compatible across application software, hardware, and operating systems.
 - **.docx**, **.xlsx**, **.pptx** (Word, Excel, and PowerPoint): the Open XML formats suppored by Microsof Office.
 - **Markdown**:  a lightweight markup language to add formatting elements to plain text documents.
 - **AsciiDoc**: a plain text markup language for writing technical content.
 - **HTML** (Hypertext Markup Language): the standard markup language for creating web pages.
 - **XHTML** (Extensible Hypertext Markup Language): the XML-based version of HTML.
 - **XML** (Extensible Markup Language): a markup format for storing and transmitting data. Due to its flexibility, Docling requires custom implementations to identify the
 semantics of the data. Currently, Docling supports the parsing of [USPTO](https://www.uspto.gov/patents) patents and [PubMed Central® (PMC)](https://pmc.ncbi.nlm.nih.gov/) articles.
 ### Advanced options
 #### Adjust pipeline features
@ -126,6 +140,32 @@ result = converter.convert(source)
 You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
 #### Use specific backend converters
 By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)).
 You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
 Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
 ```python
 import urllib.request
 from io import BytesIO
 from docling.backend.html_backend import HTMLDocumentBackend
 from docling.datamodel.base_models import InputFormat
 from docling.datamodel.document import InputDocument
 url = "https://en.wikipedia.org/wiki/Duck"
 text = urllib.request.urlopen(url).read()
 in_doc = InputDocument(
    path_or_stream=BytesIO(text),
    format=InputFormat.HTML,
    backend=HTMLDocumentBackend,
    filename="duck.html",
 )
 backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
 result = backend.convert()
 print(result.export_to_markdown())
 ```
 ## Chunking
 You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a