# Advanced Chunking

In this notebook, we demonstrate an advanced chunking example, showcasing how a user can:
- serialize and include some parts of the metadata (as per application logic) into the final chunk text, and
- leverage a tokenizer to build specialized chunking logic, e.g. to impose a maximum token length and futher split chunks beyond that.

We first convert an example document:

In [1]:
from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
doc = converter.convert(source=source).document

Below we define the metadata serialization logic and the specific usage of the tokenizer for applying the token limits.

The whole process is wrapped as a `BaseChunker` implementation internally using a `HierarchicalChunker` and applying the logic on top of the results of the latter.

In [2]:
from copy import deepcopy
from typing import Iterable, Iterator

from docling_core.transforms.chunker import (
    BaseChunk,
    BaseChunker,
    DocMeta,
    HierarchicalChunker,
)
from docling_core.types.doc import DoclingDocument as DLDocument
from pydantic import ConfigDict, PositiveInt
from transformers import AutoTokenizer


class MaxTokenLimitingChunker(BaseChunker):
    model_config = ConfigDict(arbitrary_types_allowed=True)

    inner_chunker: BaseChunker = HierarchicalChunker()
    tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")
    max_tokens: PositiveInt = 512
    delim: str = "\n"

    def _serialize_meta_to_include(self, meta: DocMeta) -> str:
        meta_parts = []
        headings_part = self.delim.join(meta.headings or [])
        if headings_part:
            meta_parts.append(headings_part)
        captions_part = self.delim.join(meta.captions or [])
        if captions_part:
            meta_parts.append(captions_part)
        return self.delim.join(meta_parts)

    def _split_above_max_tokens(self, chunk_iter: Iterable[BaseChunk]):
        for chunk in chunk_iter:
            meta = DocMeta.model_validate(chunk.meta)
            meta_text = self._serialize_meta_to_include(meta=meta)
            meta_list = [meta_text] if meta_text else []
            full_ser = self.delim.join(meta_list + ([chunk.text] if chunk.text else []))

            meta_tokens = self.tokenizer(
                meta_text, return_offsets_mapping=True, add_special_tokens=False
            )["offset_mapping"]
            delim_tokens = (
                self.tokenizer(
                    self.delim, return_offsets_mapping=True, add_special_tokens=False
                )["offset_mapping"]
                if meta_text
                else []
            )
            num_tokens_avail_for_text = self.max_tokens - (
                len(meta_tokens) + len(delim_tokens)
            )

            text_tokens = self.tokenizer(
                chunk.text, return_offsets_mapping=True, add_special_tokens=False
            )["offset_mapping"]
            num_text_tokens = len(text_tokens)

            if (
                num_text_tokens <= num_tokens_avail_for_text
            ):  # chunk already within token limit
                c = deepcopy(chunk)
                c.text = full_ser
                yield c
            else:  # chunk requires further splitting to meet token limit
                fitting_texts = [
                    chunk.text[
                        text_tokens[base][0] : text_tokens[
                            min(base + num_tokens_avail_for_text, num_text_tokens) - 1
                        ][1]
                    ]
                    for base in range(0, num_text_tokens, num_tokens_avail_for_text)
                ]
                for text in fitting_texts:
                    c = deepcopy(chunk)
                    c.text = self.delim.join(meta_list + [text])
                    yield c

    def chunk(self, dl_doc: DLDocument, **kwargs) -> Iterator[BaseChunk]:
        chunk_iter = self.inner_chunker.chunk(dl_doc=dl_doc, **kwargs)
        yield from self._split_above_max_tokens(chunk_iter=chunk_iter)

In the example invocation shown below, one can see how a single original chunk (`self_ref == "#/texts/8"`) is split into multiple ones:

In [4]:
chunker = MaxTokenLimitingChunker(max_tokens=64)
chunk_iter = chunker.chunk(dl_doc=doc)

for chunk in chunk_iter:
    meta = DocMeta.model_validate(chunk.meta)
    if meta.doc_items[0].self_ref == "#/texts/8":
        display(
            f"len={len(chunker.tokenizer(chunk.text, return_offsets_mapping=True, add_special_tokens=False)['offset_mapping'])} text={chunk.text}"
        )

'len=64 text=1 Introduction\nConverting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and printing-optimized characteristic, which discards most structural features and metadata. With the advent of LLMs and popular application patterns such as retrieval-augmented generation ('

'len=64 text=1 Introduction\nRAG), leveraging the rich content embedded in PDFs has become ever more relevant. In the past decade, several powerful document understanding solutions have emerged on the market, most of which are commercial software, cloud offerings [3] and most recently, multi-modal vision-language models. As of today'

'len=26 text=1 Introduction\n, only a handful of open-source tools cover PDF conversion, leaving a significant feature and quality gap to proprietary solutions.'