feat: expose new hybrid chunker, update docs (#384)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-12-08 20:58:11 +00:00 · 2024-12-09 08:28:29 +01:00
parent eb7ffcdd1c
commit c8ecdd987e
9 changed files with 598 additions and 7 deletions
--- a/docs/concepts/chunking.md
+++ b/docs/concepts/chunking.md
@@ -0,0 +1,65 @@
+## Introduction
+
+A *chunker* is a Docling abstraction that, given a
+[`DoclingDocument`](./docling_document.md), returns a stream of chunks, each of which
+captures some part of the document as a string accompanied by respective metadata.
+
+To enable both flexibility for downstream applications and out-of-the-box utility,
+Docling defines a chunker class hierarchy, providing a base type, `BaseChunker`, as well
+as specific subclasses.
+
+Docling integration with gen AI frameworks like LlamaIndex is done using the
+`BaseChunker` interface, so users can easily plug in any built-in, self-defined, or
+third-party `BaseChunker` implementation.
+
+## Base Chunker
+
+The `BaseChunker` base class API defines that any chunker should provide the following:
+
+- `def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]`:
+  Returning the chunks for the provided document.
+- `def serialize(self, chunk: BaseChunk) -> str`:
+  Returning the potentially metadata-enriched serialization of the chunk, typically
+  used to feed an embedding model (or generation model).
+
+## Hybrid Chunker
+
+!!! note "To access `HybridChunker`"
+
+    - If you are using the `docling` package, you can import as follows:
+        ```python
+        from docling.chunking import HybridChunker
+        ```
+    - If you are only using the `docling-core` package, you must ensure to install
+        the `chunking` extra, e.g.
+        ```shell
+        pip install 'docling-core[chunking]'
+        ```
+        and then you
+        can import as follows:
+        ```python
+        from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
+        ```
+
+The `HybridChunker` implementation uses a hybrid approach, applying tokenization-aware
+refinements on top of document-based [hierarchical](#hierarchical-chunker) chunking.
+
+More precisely:
+
+- it starts from the result of the hierarchical chunker and, based on the user-provided
+  tokenizer (typically to be aligned to the embedding model tokenizer), it:
+- does one pass where it splits chunks only when needed (i.e. oversized w.r.t.
+tokens), &
+- another pass where it merges chunks only when possible (i.e. undersized successive
+chunks with same headings & captions) — users can opt out of this step via param
+`merge_peers` (by default `True`)
+
+👉 Example: see  [here](../../examples/hybrid_chunking).
+
+## Hierarchical Chunker
+
+The `HierarchicalChunker` implementation uses the document structure information from
+the [`DoclingDocument`](../docling_document) to create one chunk for each individual
+detected document element, by default only merging together list items (can be opted out
+via param `merge_list_items`). It also takes care of attaching all relevant document
+metadata, including headers and captions.