{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_mongodb.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "Ag9kcX2B_atc" }, "source": [ "# RAG with MongoDB + VoyageAI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "| Step | Tech | Execution | \n", "| --- | --- | --- |\n", "| Embedding | Voyage AI | 🌐 Remote |\n", "| Vector store | MongoDB | 🌐 Remote |\n", "| Gen AI | Azure Open AI | 🌐 Remote |\n", "\n", "## How to cook\n", "\n", "This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) pipeline using MongoDB as a vector store and Voyage AI embedding models for semantic search. The workflow involves extracting and chunking text from documents, generating embeddings with Voyage AI, storing vectors in MongoDB, and leveraging OpenAI for generative responses.\n", "\n", "- **MongoDB Vector Search:** MongoDB supports storing and searching high-dimensional vectors, enabling efficient similarity search for RAG applications. Learn more: [MongoDB Vector Search](https://www.mongodb.com/products/platform/atlas-vector-search)\n", "- **Voyage AI Embeddings:** Voyage AI provides state-of-the-art embedding models for text, supporting robust semantic search and retrieval. See: [Voyage AI Documentation](https://docs.voyageai.com/)\n", "- **OpenAI LLM Models:** Azure OpenAI's models are used to generate answers based on retrieved context. More info: [Azure OpenAI API](https://azure.microsoft.com/en-us/products/ai-foundry/models/openai/)\n", "\n", "By combining these technologies, you can build scalable, production-ready RAG systems for advanced document understanding and question answering." ] }, { "cell_type": "markdown", "metadata": { "id": "4YgT7tpXCUl0" }, "source": [ "## Setting Up Your Environment\n", "\n", "First, we'll install the necessary libraries and configure our environment. These packages enable document processing, database connections, embedding generation, and AI model interaction. We're using Docling for document handling, PyMongo for MongoDB integration, VoyageAI for embeddings, and OpenAI client for generation capabilities." ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": true, "id": "u076oUSF_YUG" }, "outputs": [], "source": [ "%%capture\n", "%pip install docling~=\"2.7.0\"\n", "%pip install pymongo[srv]\n", "%pip install voyageai\n", "%pip install openai\n", "\n", "import logging\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "logging.getLogger(\"pymongo\").setLevel(logging.ERROR)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "2q2F9RUmR8Wj" }, "source": [ "## Part 1: Setting up Docling\n", "\n", "Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.\n", "\n", "The code below checks to see if a GPU is available, either via CUDA or MPS." ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MPS GPU is enabled.\n" ] } ], "source": [ "import torch\n", "\n", "# Check if GPU or MPS is available\n", "if torch.cuda.is_available():\n", " device = torch.device(\"cuda\")\n", " print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\n", "elif torch.backends.mps.is_available():\n", " device = torch.device(\"mps\")\n", " print(\"MPS GPU is enabled.\")\n", "else:\n", " raise OSError(\n", " \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "wHTsy4a8JFPl" }, "source": [ "### Single-Document RAG Baseline\n", "\n", "To begin, we will focus on a single seminal paper and treat it as the entire knowledge base. Building a Retrieval-Augmented Generation (RAG) pipeline on just one document serves as a clear, controlled baseline before scaling to multiple sources. This helps validate each stage of the workflow (parsing, chunking, embedding, retrieval, generation) without confounding factors introduced by inter-document noise." ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "id": "Vy5SMPiGDMy-" }, "outputs": [], "source": [ "# Influential machine learning papers\n", "source_urls = [\n", " \"https://arxiv.org/pdf/1706.03762\" # Attention is All You Need\n", " ]" ] }, { "cell_type": "markdown", "metadata": { "id": "5fi8wzHrCoLa" }, "source": [ "### Convert Source Documents to Markdown\n", "\n", "Convert each source URL to Markdown with Docling, reusing any already-converted document to avoid redundant downloads/parsing. Produces a dict mapping URLs to their Markdown content.\n", "\n", "There are other methods that can be used to " ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 67, "referenced_widgets": [ "6d049f786a2f4ad7857a6cf2d95b5ba2", "db2a7b9f549e4f0fb1ff3fce655d76a2", "630967a2db4c4714b4c15d1358a0fcae", "b3da9595ab7c4995a00e506e7b5202e3", "243ecaf36ee24cafbd1c33d148f2ca78", "5b7e22df1b464ca894126736e6f72207", "02f6af5993bb4a6a9dbca77952f675d2", "dea323b3de0e43118f338842c94ac065", "bd198d2c0c4c4933a6e6544908d0d846", "febd5c498e4f4f5dbde8dec3cd935502", "ab4f282c0d37451092c60e6566e8e945" ] }, "id": "Sr44xGR1PNSc", "outputId": "b5cca9ee-d7c0-4c8f-c18a-0ac4787984e9" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 73728.00it/s]\n" ] } ], "source": [ "from docling.document_converter import DocumentConverter\n", "from pprint import pprint\n", "# Instantiate the doc converter\n", "doc_converter = DocumentConverter()\n", "\n", "# Since we want to use a single document, we will convert just the first URL. For multiple documents, you can use convert_all() method and then iterate through the list of converted documents.\n", "pdf_doc = source_urls[0]\n", "converted_doc = doc_converter.convert(pdf_doc).document" ] }, { "cell_type": "markdown", "metadata": { "id": "xHun_P-OCtKd" }, "source": [ "### Post-process extracted document data\n", "\n", "We use Docling's `HierarchicalChunker()` to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline." ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "id": "L17ju9xibuIo" }, "outputs": [ { "data": { "text/plain": [ "['arXiv:1706.03762v7 [cs.CL] 2 Aug 2023',\n", " 'Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.',\n", " 'Ashish Vaswani ∗ Google Brain avaswani@google.com',\n", " 'Noam Shazeer ∗ Google Brain noam@google.com',\n", " 'Niki Parmar ∗ Google Research nikip@google.com',\n", " 'Jakob Uszkoreit ∗ Google Research usz@google.com',\n", " 'Llion Jones ∗ Google Research llion@google.com',\n", " 'Aidan N. Gomez ∗ † University of Toronto aidan@cs.toronto.edu',\n", " 'Łukasz Kaiser ∗ Google Brain lukaszkaiser@google.com',\n", " 'Illia Polosukhin ∗ ‡']" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from docling_core.transforms.chunker import HierarchicalChunker\n", "\n", "# Initialize the chunker\n", "chunker = HierarchicalChunker()\n", "\n", "# Perform hierarchical chunking on the converted document and get text from chunks\n", "chunks = list(chunker.chunk(converted_doc))\n", "chunk_texts = [chunk.text for chunk in chunks]\n", "chunk_texts[:10] # Display first 3 chunk texts" ] }, { "cell_type": "markdown", "metadata": { "id": "uhLlCpQODaT3" }, "source": [ "## Part 2: VoyageAI (by MongoDB) \n", "### We will be using VoyageAI for embedding creation." ] }, { "cell_type": "markdown", "metadata": { "id": "ho7xYQTZK5Wk" }, "source": [ "We will be using VoyageAI embedding model for converting the above chunks to embeddings, thereafter pushing them to MongoDB for further consumption. \n", "\n", "VoyageAI has a load of offerings for embedding models, we will be using `voyage-context-3` for best results in this case, which is a contextualized chunk embedding model, where chunk embedding encodes not only the chunk’s own content, but also captures the contextual information from the full document. \n", "\n", "You can go through the [blogpost](https://blog.voyageai.com/2025/07/23/voyage-context-3/) to understand how it performas in comparison to other embedding models.\n", "\n", "Create an account on Voyage and get you [API key](https://dashboard.voyageai.com/organization/api-keys). " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PD53jOT4roj2" }, "outputs": [], "source": [ "# OpenAI API key variable name\n", "VOYAGE_API_KEY=\"**********************\" \n", "\n", "import voyageai\n", "\n", "# Initialize the VoyageAI client\n", "vo = voyageai.Client(VOYAGE_API_KEY)\n", "result = vo.contextualized_embed(inputs=[chunk_texts], model=\"voyage-context-3\")\n", "contextualized_chunk_embds = [emb for r in result.results for emb in r.embeddings]" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chunk Texts Length: 118\n", "Contextualized Chunk Embeddings Length: 118\n" ] } ], "source": [ "# Check lengths to ensure they match\n", "print(\"Chunk Texts Length:\", chunk_texts.__len__())\n", "print(\"Contextualized Chunk Embeddings Length:\", contextualized_chunk_embds.__len__())" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "# Combine chunks with their embeddings\n", "chunk_data = [{\"text\": text, \"embedding\": emb} for text, emb in zip(chunk_texts, contextualized_chunk_embds)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Inserting to MongoDB \n", "With the generated embeddings prepared, we now insert them into MongoDB so they can be leveraged in the RAG pipeline.\n", "\n", "MongoDB is an ideal vector store for RAG applications because:\n", "- It supports efficient vector search capabilities through Atlas Vector Search\n", "- It scales well for large document collections\n", "- It offers flexible querying options for combining semantic and traditional search\n", "- It provides robust indexing for fast retrieval\n", "\n", "The chunks with their embeddings will be stored in a MongoDB collection, allowing us to perform similarity searches when responding to user queries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 118 documents into MongoDB.\n" ] } ], "source": [ "# Insert to MongoDB\n", "from pymongo import MongoClient\n", "\n", "client = MongoClient(\"mongodb+srv://*******.mongodb.net/\") # Replace with your MongoDB connection string\n", "db = client[\"rag_db\"] # Database name\n", "collection = db[\"documents\"] # Collection name\n", "\n", "# Insert chunk data into MongoDB\n", "response = collection.insert_many(chunk_data)\n", "print(f\"Inserted {len(response.inserted_ids)} documents into MongoDB.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating Atlas Vector search index \n", "Using `pymongo` we can create a vector index, that will help us search through our vectors and respond to user queries. This index is crucial for efficient similarity searches between user questions and our document chunks. MongoDB Atlas Vector Search provides fast and accurate retrieval of semantically related content, which forms the foundation of our RAG pipeline." ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "id": "kttDgwZEsIJQ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New search index named vector_index is building.\n" ] } ], "source": [ "from pymongo.operations import SearchIndexModel\n", "\n", "# Create your index model, then create the search index\n", "search_index_model = SearchIndexModel(\n", " definition={\n", " \"fields\": [\n", " {\n", " \"type\": \"vector\",\n", " \"path\": \"embedding\",\n", " \"numDimensions\": 1024,\n", " \"similarity\": \"dotProduct\"\n", " }\n", " ]\n", " },\n", " name=\"vector_index\",\n", " type=\"vectorSearch\"\n", ")\n", "result = collection.create_search_index(model=search_index_model)\n", "print(\"New search index named \" + result + \" is building.\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "KI01PxjuD_XR" }, "source": [ "### Query the vectorized data\n", "\n", "To perform a query on the vectorized data stored in MongoDB, we can use the `$vectorSearch` aggregation pipeline. This powerful feature of MongoDB Atlas enables semantic search capabilities by finding documents based on vector similarity.\n", "\n", "When executing a vector search query:\n", "\n", "1. MongoDB computes the similarity between the query vector and vectors stored in the collection\n", "2. The documents are ranked by their similarity score\n", "3. The top-N most similar results are returned\n", "\n", "This enables us to find semantically related content rather than relying on exact keyword matches. The similarity metric we're using (dot product) measures the cosine similarity between vectors, allowing us to identify content that is conceptually similar even if it uses different terminology.\n", "\n", "For RAG applications, this vector search capability is crucial as it allows us to retrieve the most relevant context from our document collection based on the semantic meaning of a user's query, providing the foundation for generating accurate and contextually appropriate responses." ] }, { "cell_type": "markdown", "metadata": { "id": "elo32iMnEC18" }, "source": [ "## Part 4: Perform RAG on parsed articles\n", "\n", "Weaviate's `generate` module allows you to perform RAG over your embedded data without having to use a separate framework.\n", "\n", "We specify a prompt that includes the field we want to search through in the database (in this case it's `text`), a query that includes our search term, and the number of retrieved results to use in the generation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 233 }, "id": "7r2LMSX9bO4y", "outputId": "84639adf-7783-4d43-94d9-711fb313a168" }, "outputs": [ { "data": { "text/html": [ "
╭──────────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────────╮\n", "│ Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context. │\n", "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n", "\n" ], "text/plain": [ "\u001b[1;31m╭─\u001b[0m\u001b[1;31m───────────────────────────────────────────────────\u001b[0m\u001b[1;31m Prompt \u001b[0m\u001b[1;31m────────────────────────────────────────────────────\u001b[0m\u001b[1;31m─╮\u001b[0m\n", "\u001b[1;31m│\u001b[0m Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context. \u001b[1;31m│\u001b[0m\n", "\u001b[1;31m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
╭─────────────────────────────────────────────── Generated Content ───────────────────────────────────────────────╮\n", "│ 1. **Introduction of the Transformer Architecture**: The Transformer model is a novel architecture that relies │\n", "│ entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This allows for │\n", "│ significantly more parallelization during training and leads to superior performance in tasks such as machine │\n", "│ translation. │\n", "│ │\n", "│ 2. **Performance and Efficiency**: The Transformer achieves state-of-the-art results on machine translation │\n", "│ tasks, such as a BLEU score of 28.4 on the WMT 2014 English-to-German task and 41.8 on the English-to-French │\n", "│ task, while requiring much less training time (3.5 days on eight GPUs) compared to previous models. This │\n", "│ demonstrates the efficiency and effectiveness of the architecture. │\n", "│ │\n", "│ 3. **Self-Attention Mechanism**: The self-attention layers in both the encoder and decoder allow for each │\n", "│ position to attend to all other positions in the sequence, enabling the model to capture global dependencies. │\n", "│ This mechanism is more computationally efficient than recurrent layers, which require sequential operations, │\n", "│ thus improving the model's speed and scalability. │\n", "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n", "\n" ], "text/plain": [ "\u001b[1;32m╭─\u001b[0m\u001b[1;32m──────────────────────────────────────────────\u001b[0m\u001b[1;32m Generated Content \u001b[0m\u001b[1;32m──────────────────────────────────────────────\u001b[0m\u001b[1;32m─╮\u001b[0m\n", "\u001b[1;32m│\u001b[0m 1. **Introduction of the Transformer Architecture**: The Transformer model is a novel architecture that relies \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This allows for \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m significantly more parallelization during training and leads to superior performance in tasks such as machine \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m translation. \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m 2. **Performance and Efficiency**: The Transformer achieves state-of-the-art results on machine translation \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m tasks, such as a BLEU score of 28.4 on the WMT 2014 English-to-German task and 41.8 on the English-to-French \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m task, while requiring much less training time (3.5 days on eight GPUs) compared to previous models. This \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m demonstrates the efficiency and effectiveness of the architecture. \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m 3. **Self-Attention Mechanism**: The self-attention layers in both the encoder and decoder allow for each \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m position to attend to all other positions in the sequence, enabling the model to capture global dependencies. \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m This mechanism is more computationally efficient than recurrent layers, which require sequential operations, \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m│\u001b[0m thus improving the model's speed and scalability. \u001b[1;32m│\u001b[0m\n", "\u001b[1;32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from rich.console import Console\n", "from rich.panel import Panel\n", "from openai import AzureOpenAI\n", "import os\n", "\n", "# Create MongoDB vector search query for \"Attention is All You Need\"\n", "# (prompt already defined above, reuse if present; else keep this definition)\n", "prompt = \"Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context.\"\n", "\n", "# Generate embedding for the query using VoyageAI (vo already initialized earlier)\n", "query_embd_context = vo.contextualized_embed(\n", " inputs=[[prompt]],\n", " model=\"voyage-context-3\",\n", " input_type=\"query\"\n", ").results[0].embeddings[0]\n", "\n", "# Vector search pipeline\n", "search_pipeline = [\n", " {\n", " \"$vectorSearch\": {\n", " \"index\": \"vector_index\",\n", " \"path\": \"embedding\",\n", " \"queryVector\": query_embd_context,\n", " \"numCandidates\": 10,\n", " \"limit\": 10\n", " }\n", " },\n", " {\n", " \"$project\": {\n", " \"text\": 1,\n", " \"_id\": 0,\n", " \"score\": {\"$meta\": \"vectorSearchScore\"}\n", " }\n", " }\n", "]\n", "\n", "results = list(collection.aggregate(search_pipeline))\n", "if not results:\n", " raise ValueError(\"No vector search results returned. Verify the index is built before querying.\")\n", "\n", "context_texts = [doc[\"text\"] for doc in results]\n", "combined_context = \"\\n\\n\".join(context_texts)\n", "\n", "# Expect these environment variables to be set (do NOT hardcode secrets):\n", "# AZURE_OPENAI_API_KEY\n", "# AZURE_OPENAI_ENDPOINT -> e.g. https://your-resource-name.openai.azure.com/\n", "# AZURE_OPENAI_API_VERSION (optional, else fallback)\n", "AZURE_OPENAI_API_KEY = \"**********************\"\n", "AZURE_OPENAI_ENDPOINT = \"**********************\"\n", "AZURE_OPENAI_API_VERSION = \"**********************\"\n", "\n", "# Initialize Azure OpenAI client (endpoint must NOT include path segments)\n", "client = AzureOpenAI(\n", " api_key=AZURE_OPENAI_API_KEY,\n", " azure_endpoint=AZURE_OPENAI_ENDPOINT.rstrip(\"/\"),\n", " api_version=AZURE_OPENAI_API_VERSION\n", ")\n", "\n", "# Chat completion using retrieved context\n", "response = client.chat.completions.create(\n", " model=\"gpt-4o-mini\", # Azure deployment name\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": \"You are a helpful assistant. Use only the provided context to answer questions. If the context is insufficient, say so.\"\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": f\"Context:\\n{combined_context}\\n\\nQuestion: {prompt}\"\n", " }\n", " ],\n", " temperature=0.2\n", ")\n", "\n", "response_text = response.choices[0].message.content\n", "\n", "console = Console()\n", "console.print(Panel(f\"{prompt}\", title=\"Prompt\", border_style=\"bold red\"))\n", "console.print(Panel(response_text, title=\"Generated Content\", border_style=\"bold green\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook demonstrated a powerful RAG pipeline using MongoDB, VoyageAI, and Azure OpenAI. By combining MongoDB's vector search capabilities with VoyageAI's embeddings and Azure OpenAI's language models, we created an intelligent document retrieval system.\n", "\n", "### Key Achievements:\n", "- Processed documents with Docling\n", "- Generated contextual embeddings with VoyageAI\n", "- Stored vectors in MongoDB Atlas\n", "- Implemented semantic search for relevant context retrieval\n", "- Generated accurate responses with Azure OpenAI\n", "\n", "### Next Steps:\n", "1. Expand your knowledge base with more documents\n", "2. Experiment with chunking and embedding parameters\n", "3. Build a user interface\n", "4. Implement evaluation metrics\n", "5. Deploy to production with proper scaling\n", "\n", "Start building your own intelligent document retrieval system today!\n" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 0 }