diff --git a/docs/examples/backend_xml_rag.ipynb b/docs/examples/backend_xml_rag.ipynb new file mode 100644 index 00000000..db08f24c --- /dev/null +++ b/docs/examples/backend_xml_rag.ipynb @@ -0,0 +1,18 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Converting US Patents from XML files for a RAG application" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/usage.md b/docs/usage.md index 9a5b555a..824f0f22 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -24,6 +24,20 @@ docling https://arxiv.org/pdf/2206.01062 To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md). +### Supported formats + +The document conversion in Docling supports several popular formats, including: + +- **PDF** (Portable Document Format): the format developed by Adobe to present documents compatible across application software, hardware, and operating systems. +- **.docx**, **.xlsx**, **.pptx** (Word, Excel, and PowerPoint): the Open XML formats suppored by Microsof Office. +- **Markdown**: a lightweight markup language to add formatting elements to plain text documents. +- **AsciiDoc**: a plain text markup language for writing technical content. +- **HTML** (Hypertext Markup Language): the standard markup language for creating web pages. +- **XHTML** (Extensible Hypertext Markup Language): the XML-based version of HTML. +- **XML** (Extensible Markup Language): a markup format for storing and transmitting data. Due to its flexibility, Docling requires custom implementations to identify the +semantics of the data. Currently, Docling supports the parsing of [USPTO](https://www.uspto.gov/patents) patents and [PubMed CentralĀ® (PMC)](https://pmc.ncbi.nlm.nih.gov/) articles. + + ### Advanced options #### Adjust pipeline features @@ -126,6 +140,32 @@ result = converter.convert(source) You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. +#### Use specific backend converters + +By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)). +You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example. +Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages: + +```python +import urllib.request +from io import BytesIO +from docling.backend.html_backend import HTMLDocumentBackend +from docling.datamodel.base_models import InputFormat +from docling.datamodel.document import InputDocument + +url = "https://en.wikipedia.org/wiki/Duck" +text = urllib.request.urlopen(url).read() +in_doc = InputDocument( + path_or_stream=BytesIO(text), + format=InputFormat.HTML, + backend=HTMLDocumentBackend, + filename="duck.html", +) +backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text)) +result = backend.convert() +print(result.export_to_markdown()) +``` + ## Chunking You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a