# Serialization

## Overview

In this notebook we showcase the usage of Docling [serializers](../../concepts/serialization).

## Setup

In [1]:
%pip install -qU pip docling docling-core~=2.29

Note: you may need to restart the kernel to use updated packages.


In [2]:
DOC_SOURCE = "https://arxiv.org/pdf/2311.18481"

# we set some start-stop cues for defining an excerpt to print
start_cue_incl = "Copyright © 2024"
stop_cue_excl = "Application of NLP to ESG"

## Basic usage

We first convert the document:

In [3]:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
doc = converter.convert(source=DOC_SOURCE).document



We can now apply any `BaseDocSerializer` on the produced document.

E.g. below we apply an `HTMLDocSerializer`:

In [4]:
from docling_core.transforms.serializer.html import HTMLDocSerializer

serializer = HTMLDocSerializer(doc=doc)
ser_result = serializer.serialize()
ser_text = ser_result.text

print(ser_text[ser_text.find(start_cue_incl) : ser_text.find(stop_cue_excl)])

Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.</p>
<table><tbody><tr><th>Report</th><th>Question</th><th>Answer</th></tr><tr><td>IBM 2022</td><td>How many hours were spent on employee learning in 2021?</td><td>22.5 million hours</td></tr><tr><td>IBM 2022</td><td>What was the rate of fatalities in 2021?</td><td>The rate of fatalities in 2021 was 0.0016.</td></tr><tr><td>IBM 2022</td><td>How many full audits were con- ducted in 2022 in India?</td><td>2</td></tr><tr><td>Starbucks 2022</td><td>What is the percentage of women in the Board of Directors?</td><td>25%</td></tr><tr><td>Starbucks 2022</td><td>What was the total energy con- sumption in 2021?</td><td>According to the table, the total energy consumption in 2021 was 2,491,543 MWh.</td></tr><tr><td>Starbucks 2022</td><td>How much packaging material was made from renewable mate- rials?</td><td>According to the given data, 31% of packaging materials were made from recycl

In the following example, we use a `MarkdownDocSerializer`:

In [5]:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer

serializer = MarkdownDocSerializer(doc=doc)
ser_result = serializer.serialize()
ser_text = ser_result.text

print(ser_text[ser_text.find(start_cue_incl) : ser_text.find(stop_cue_excl)])

Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

| Report         | Question                                                         | Answer                                                                                                          |
|----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |
| IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |
| IBM 2022       | How many full audits were con- ducted in 2022 in India?          |

## Configuring a serializer

Let's now assume we would like to reconfigure the Markdown serialization such that:
- it uses a different component serializer, e.g. if we'd prefer tables to be printed in a triplet format (which could potentially improve the vector representation compared to Markdown tables)
- it uses specific user-defined parameters, e.g. if we'd prefer a different image placeholder text than the default one

Check out the following configuration and notice the serialization differences in the output further below:

In [None]:
from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer
from docling_core.transforms.serializer.markdown import MarkdownParams

serializer = MarkdownDocSerializer(
    doc=doc,
    table_serializer=TripletTableSerializer(),
    params=MarkdownParams(
        image_placeholder="<!-- demo picture placeholder -->",
        # ...
    ),
)
ser_result = serializer.serialize()
ser_text = ser_result.text

print(ser_text[ser_text.find(start_cue_incl) : ser_text.find(stop_cue_excl)])

Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.

Table 1: Ex

## Creating a custom serializer

In the examples above, we were able to reuse existing implementations for our desired
serialization strategy, but let's now assume we want to define a custom serialization
logic, e.g. we would like picture serialization to include any available picture
description (captioning) annotations.

To that end, we first need to revisit our conversion and include all pipeline options
needed for
[picture description enrichment](https://docling-project.github.io/docling/usage/enrichments/#picture-description).

In [7]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    PictureDescriptionVlmOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_picture_description=True,
    picture_description_options=PictureDescriptionVlmOptions(
        repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
        prompt="Describe this picture in three to five sentences. Be precise and concise.",
    ),
    generate_picture_images=True,
    images_scale=2,
)

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
doc = converter.convert(source=DOC_SOURCE).document



We can then define our custom picture serializer:

In [8]:
from typing import Any, Optional

from docling_core.transforms.serializer.base import (
    BaseDocSerializer,
    SerializationResult,
)
from docling_core.transforms.serializer.common import create_ser_result
from docling_core.transforms.serializer.markdown import (
    MarkdownParams,
    MarkdownPictureSerializer,
)
from docling_core.types.doc.document import (
    DoclingDocument,
    ImageRefMode,
    PictureDescriptionData,
    PictureItem,
)
from typing_extensions import override


class AnnotationPictureSerializer(MarkdownPictureSerializer):
    @override
    def serialize(
        self,
        *,
        item: PictureItem,
        doc_serializer: BaseDocSerializer,
        doc: DoclingDocument,
        separator: Optional[str] = None,
        **kwargs: Any,
    ) -> SerializationResult:
        text_parts: list[str] = []

        # reusing the existing result:
        parent_res = super().serialize(
            item=item,
            doc_serializer=doc_serializer,
            doc=doc,
            **kwargs,
        )
        text_parts.append(parent_res.text)

        # appending annotations:
        for annotation in item.annotations:
            if isinstance(annotation, PictureDescriptionData):
                text_parts.append(f"<!-- Picture description: {annotation.text} -->")

        text_res = (separator or "\n").join(text_parts)
        return create_ser_result(text=text_res, span_source=item)

Last but not least, we define a new doc serializer which leverages our custom picture
serializer.

Notice the picture description annotations in the output below:

In [9]:
serializer = MarkdownDocSerializer(
    doc=doc,
    picture_serializer=AnnotationPictureSerializer(),
    params=MarkdownParams(
        image_mode=ImageRefMode.PLACEHOLDER,
        image_placeholder="",
    ),
)
ser_result = serializer.serialize()
ser_text = ser_result.text

print(ser_text[ser_text.find(start_cue_incl) : ser_text.find(stop_cue_excl)])

Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

| Report         | Question                                                         | Answer                                                                                                          |
|----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |
| IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |
| IBM 2022       | How many full audits were con- ducted in 2022 in India?          |