docs: add integrations, revamp docs (#693)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Panos Vagenas
2025-01-07 14:15:54 +01:00
committed by GitHub
parent d49650c54f
commit 2d24faecd9
11 changed files with 355 additions and 330 deletions

View File

@@ -4,7 +4,30 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hybrid Chunking"
"# Hybrid chunking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.\n",
"\n",
"For more details, see [here](../../concepts/chunking#hybrid-chunker)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
@@ -21,7 +44,7 @@
}
],
"source": [
"%pip install -qU 'docling-core[chunking]' sentence-transformers transformers lancedb"
"%pip install -qU 'docling-core[chunking]' sentence-transformers transformers"
]
},
{
@@ -48,16 +71,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chunking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how `tokenizer` and `embed_model` further below are single-sourced from `EMBED_MODEL_ID`.\n",
"## Chunking\n",
"\n",
"This is important for making sure the chunker and the embedding model are using the same tokenizer."
"### Basic usage\n",
"\n",
"For a basic usage scenario, we can just instantiate a `HybridChunker`, which will use\n",
"the default parameters."
]
},
{
@@ -65,20 +84,102 @@
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from docling.chunking import HybridChunker\n",
"\n",
"chunker = HybridChunker()\n",
"chunk_iter = chunker.chunk(dl_doc=doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the text you would typically want to embed is the context-enriched one as\n",
"returned by the `serialize()` method:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 0 ===\n",
"chunk.text:\n",
"'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver…'\n",
"chunker.serialize(chunk):\n",
"'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial …'\n",
"\n",
"=== 1 ===\n",
"chunk.text:\n",
"'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa…'\n",
"chunker.serialize(chunk):\n",
"'IBM\\n1910s1950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889…'\n",
"\n",
"=== 2 ===\n",
"chunk.text:\n",
"'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,…'\n",
"chunker.serialize(chunk):\n",
"'IBM\\n1910s1950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John …'\n",
"\n",
"=== 3 ===\n",
"chunk.text:\n",
"'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'\n",
"chunker.serialize(chunk):\n",
"'IBM\\n1960s1980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'\n",
"\n"
]
}
],
"source": [
"for i, chunk in enumerate(chunk_iter):\n",
" print(f\"=== {i} ===\")\n",
" print(f\"chunk.text:\\n{repr(f'{chunk.text[:300]}…')}\")\n",
"\n",
" enriched_text = chunker.serialize(chunk=chunk)\n",
" print(f\"chunker.serialize(chunk):\\n{repr(f'{enriched_text[:300]}…')}\")\n",
"\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Advanced usage\n",
"\n",
"For more control on the chunking, we can parametrize through the `HybridChunker`\n",
"arguments illustrated below.\n",
"\n",
"Notice how `tokenizer` and `embed_model` further below are single-sourced from\n",
"`EMBED_MODEL_ID`.\n",
"This is important for making sure the chunker and the embedding model are using the same\n",
"tokenizer."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"\n",
"from docling.chunking import HybridChunker\n",
"\n",
"EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
"MAX_TOKENS = 64\n",
"MAX_TOKENS = 64 # set to a small number for illustrative purposes\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(EMBED_MODEL_ID)\n",
"\n",
"chunker = HybridChunker(\n",
" tokenizer=tokenizer, # can also just pass model name instead of tokenizer instance\n",
" tokenizer=tokenizer, # instance or model name, defaults to \"sentence-transformers/all-MiniLM-L6-v2\"\n",
" max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer`\n",
" # merge_peers=True, # optional, defaults to True\n",
" merge_peers=True, # optional, defaults to True\n",
")\n",
"chunk_iter = chunker.chunk(dl_doc=doc)\n",
"chunks = list(chunk_iter)"
@@ -88,7 +189,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Points to notice:\n",
"Points to notice looking at the output chunks below:\n",
"- Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)\n",
"- Where neeeded, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)\n",
"- Where possible, we merge undersized peer chunks (see chunk 0)\n",
@@ -97,7 +198,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 6,
"metadata": {},
"outputs": [
{
@@ -245,174 +346,6 @@
"\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Vector Retrieval"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"To disable this warning, you can either:\n",
"\t- Avoid using `tokenizers` before the fork if possible\n",
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
]
}
],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"\n",
"embed_model = SentenceTransformer(EMBED_MODEL_ID)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vector</th>\n",
" <th>text</th>\n",
" <th>headings</th>\n",
" <th>captions</th>\n",
" <th>_distance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>[-0.1269039, -0.01948185, -0.07718097, -0.1116...</td>\n",
" <td>language, and the UPC barcode. The company has...</td>\n",
" <td>[IBM]</td>\n",
" <td>None</td>\n",
" <td>1.164613</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>[-0.10198064, 0.0055981805, -0.05095279, -0.13...</td>\n",
" <td>IBM originated with several technological inno...</td>\n",
" <td>[IBM, 1910s1950s]</td>\n",
" <td>None</td>\n",
" <td>1.245144</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>[-0.057121325, -0.034115084, -0.018113216, -0....</td>\n",
" <td>As one of the world's oldest and largest techn...</td>\n",
" <td>[IBM]</td>\n",
" <td>None</td>\n",
" <td>1.355586</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>[-0.04429054, -0.058111433, -0.009330196, -0.0...</td>\n",
" <td>IBM is the largest industrial research organiz...</td>\n",
" <td>[IBM]</td>\n",
" <td>None</td>\n",
" <td>1.398617</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>[-0.11920792, 0.053496413, -0.042391937, -0.03...</td>\n",
" <td>Awards.[16]</td>\n",
" <td>[IBM]</td>\n",
" <td>None</td>\n",
" <td>1.446295</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" vector \\\n",
"0 [-0.1269039, -0.01948185, -0.07718097, -0.1116... \n",
"1 [-0.10198064, 0.0055981805, -0.05095279, -0.13... \n",
"2 [-0.057121325, -0.034115084, -0.018113216, -0.... \n",
"3 [-0.04429054, -0.058111433, -0.009330196, -0.0... \n",
"4 [-0.11920792, 0.053496413, -0.042391937, -0.03... \n",
"\n",
" text headings \\\n",
"0 language, and the UPC barcode. The company has... [IBM] \n",
"1 IBM originated with several technological inno... [IBM, 1910s1950s] \n",
"2 As one of the world's oldest and largest techn... [IBM] \n",
"3 IBM is the largest industrial research organiz... [IBM] \n",
"4 Awards.[16] [IBM] \n",
"\n",
" captions _distance \n",
"0 None 1.164613 \n",
"1 None 1.245144 \n",
"2 None 1.355586 \n",
"3 None 1.398617 \n",
"4 None 1.446295 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pathlib import Path\n",
"from tempfile import mkdtemp\n",
"\n",
"import lancedb\n",
"\n",
"\n",
"def make_lancedb_index(db_uri, index_name, chunks, embedding_model):\n",
" db = lancedb.connect(db_uri)\n",
" data = []\n",
" for chunk in chunks:\n",
" embeddings = embedding_model.encode(chunker.serialize(chunk=chunk))\n",
" data_item = {\n",
" \"vector\": embeddings,\n",
" \"text\": chunk.text,\n",
" \"headings\": chunk.meta.headings,\n",
" \"captions\": chunk.meta.captions,\n",
" }\n",
" data.append(data_item)\n",
" tbl = db.create_table(index_name, data=data, exist_ok=True)\n",
" return tbl\n",
"\n",
"\n",
"db_uri = str(Path(mkdtemp()) / \"docling.db\")\n",
"index = make_lancedb_index(db_uri, doc.name, chunks, embed_model)\n",
"\n",
"sample_query = \"invent\"\n",
"sample_embedding = embed_model.encode(sample_query)\n",
"results = index.search(sample_embedding).limit(5)\n",
"\n",
"results.to_pandas()"
]
}
],
"metadata": {