mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
@@ -7,7 +7,7 @@ pydantic datatype, which can express several features common to documents, such
|
||||
* Layout information (i.e. bounding boxes) for all items, if available
|
||||
* Provenance information
|
||||
|
||||
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/DS4SD/docling-core/tree/main/docling_core/types/doc).
|
||||
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/docling-project/docling-core/tree/main/docling_core/types/doc).
|
||||
|
||||
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -36,7 +36,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This is an example of using [Docling](https://ds4sd.github.io/docling/) for converting structured data (XML) into a unified document\n",
|
||||
"This is an example of using [Docling](https://docling-project.github.io/docling/) for converting structured data (XML) into a unified document\n",
|
||||
"representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n",
|
||||
"\n",
|
||||
"Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n",
|
||||
|
||||
@@ -103,7 +103,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://ds4sd.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
|
||||
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -321,7 +321,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "docling-aMWN2FRM-py3.12",
|
||||
"display_name": "docling-hgXEfXco-py3.12",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
|
||||
@@ -36,7 +36,7 @@
|
||||
"## A recipe 🧑🍳 🐥 💚\n",
|
||||
"\n",
|
||||
"This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:\n",
|
||||
"- [Docling](https://ds4sd.github.io/docling/) for document parsing and chunking\n",
|
||||
"- [Docling](https://docling-project.github.io/docling/) for document parsing and chunking\n",
|
||||
"- [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search/?msockid=0109678bea39665431e37323ebff6723) for vector indexing and retrieval\n",
|
||||
"- [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service?msockid=0109678bea39665431e37323ebff6723) for embeddings and chat completion\n",
|
||||
"\n",
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -247,7 +247,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
|
||||
"/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
|
||||
" warnings.warn(\n"
|
||||
]
|
||||
}
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -168,7 +168,7 @@
|
||||
"source": [
|
||||
"> Note: a message saying `\"Token indices sequence length is longer than the specified\n",
|
||||
"maximum sequence length...\"` can be ignored in this case — details\n",
|
||||
"[here](https://github.com/DS4SD/docling-core/issues/119#issuecomment-2577418826)."
|
||||
"[here](https://github.com/docling-project/docling-core/issues/119#issuecomment-2577418826)."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"[](https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
|
||||
"[](https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -29,7 +29,7 @@
|
||||
"\n",
|
||||
"## A recipe 🧑🍳 🐥 💚\n",
|
||||
"\n",
|
||||
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://ds4sd.github.io/docling/).\n",
|
||||
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://docling-project.github.io/docling/).\n",
|
||||
"\n",
|
||||
"In this notebook, we accomplish the following:\n",
|
||||
"* Parse the top machine learning papers on [arXiv](https://arxiv.org/) using Docling\n",
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
|
||||
".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
@@ -109,7 +109,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
|
||||
"/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
|
||||
" warnings.warn(\n"
|
||||
]
|
||||
}
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# FAQ
|
||||
|
||||
This is a collection of FAQ collected from the user questions on <https://github.com/DS4SD/docling/discussions>.
|
||||
This is a collection of FAQ collected from the user questions on <https://github.com/docling-project/docling/discussions>.
|
||||
|
||||
|
||||
??? question "Is Python 3.13 supported?"
|
||||
@@ -41,7 +41,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
||||
]
|
||||
```
|
||||
|
||||
Source: Issue [#283](https://github.com/DS4SD/docling/issues/283#issuecomment-2465035868)
|
||||
Source: Issue [#283](https://github.com/docling-project/docling/issues/283#issuecomment-2465035868)
|
||||
|
||||
|
||||
??? question "Are text styles (bold, underline, etc) supported?"
|
||||
@@ -74,7 +74,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
||||
)
|
||||
```
|
||||
|
||||
Source: Issue [#326](https://github.com/DS4SD/docling/issues/326)
|
||||
Source: Issue [#326](https://github.com/docling-project/docling/issues/326)
|
||||
|
||||
|
||||
??? question " Which model weights are needed to run Docling?"
|
||||
@@ -84,7 +84,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
||||
|
||||
For processing PDF documents, Docling requires the model weights from <https://huggingface.co/ds4sd/docling-models>.
|
||||
|
||||
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/DS4SD/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
|
||||
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/docling-project/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
|
||||
|
||||
|
||||
??? question "SSL error downloading model weights"
|
||||
@@ -174,6 +174,6 @@ This is a collection of FAQ collected from the user questions on <https://github
|
||||
print(f"Model max length: {tokenizer.model_max_length}")
|
||||
```
|
||||
|
||||
Also see [docling#725](https://github.com/DS4SD/docling/issues/725).
|
||||
Also see [docling#725](https://github.com/docling-project/docling/issues/725).
|
||||
|
||||
Source: Issue [docling-core#119](https://github.com/DS4SD/docling-core/issues/119)
|
||||
Source: Issue [docling-core#119](https://github.com/docling-project/docling-core/issues/119)
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
[](https://pycqa.github.io/isort/)
|
||||
[](https://pydantic.dev)
|
||||
[](https://github.com/pre-commit/pre-commit)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://pepy.tech/projects/docling)
|
||||
|
||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||
|
||||
@@ -5,7 +5,7 @@ Docling is available as a converter in [Haystack](https://haystack.deepset.ai/):
|
||||
- 🧑🏽🍳 [Docling Haystack integration example][example]
|
||||
- 📦 [Docling Haystack integration PyPI][pypi]
|
||||
|
||||
[github]: https://github.com/DS4SD/docling-haystack
|
||||
[github]: https://github.com/docling-project/docling-haystack
|
||||
[docs]: https://haystack.deepset.ai/integrations/docling
|
||||
[pypi]: https://pypi.org/project/docling-haystack
|
||||
[example]: ../examples/rag_haystack.ipynb
|
||||
|
||||
@@ -8,7 +8,7 @@ To get started, check out the [step-by-step guide in LangChain][guide].
|
||||
- 📦 [LangChain Docling integration PyPI][pypi]
|
||||
|
||||
[docs]: https://python.langchain.com/docs/integrations/providers/docling/
|
||||
[github]: https://github.com/DS4SD/docling-langchain
|
||||
[github]: https://github.com/docling-project/docling-langchain
|
||||
[guide]: https://python.langchain.com/docs/integrations/document_loaders/docling/
|
||||
[example]: ../examples/rag_langchain.ipynb
|
||||
[pypi]: https://pypi.org/project/langchain-docling/
|
||||
|
||||
Reference in New Issue
Block a user