mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
docs: add serialization docs, update chunking docs (#1556)
* docs: add serializers docs, update chunking docs Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update notebook to improve MD table rendering Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
This commit is contained in:
@@ -10,7 +10,8 @@ For each document format, the *document converter* knows which format-specific *
|
||||
|
||||
The *conversion result* contains the [*Docling document*](./docling_document.md), Docling's fundamental document representation.
|
||||
|
||||
Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a [*chunker*](./chunking.md).
|
||||
Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it serialized by a
|
||||
[*serializer*](./serialization.md) or chunked by a [*chunker*](./chunking.md).
|
||||
|
||||
For more details on Docling's architecture, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).
|
||||
|
||||
|
||||
@@ -31,7 +31,7 @@ The `BaseChunker` base class API defines that any chunker should provide the fol
|
||||
|
||||
- `def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]`:
|
||||
Returning the chunks for the provided document.
|
||||
- `def serialize(self, chunk: BaseChunk) -> str`:
|
||||
- `def contextualize(self, chunk: BaseChunk) -> str`:
|
||||
Returning the potentially metadata-enriched serialization of the chunk, typically
|
||||
used to feed an embedding model (or generation model).
|
||||
|
||||
@@ -44,10 +44,14 @@ The `BaseChunker` base class API defines that any chunker should provide the fol
|
||||
from docling.chunking import HybridChunker
|
||||
```
|
||||
- If you are only using the `docling-core` package, you must ensure to install
|
||||
the `chunking` extra, e.g.
|
||||
the `chunking` extra if you want to use HuggingFace tokenizers, e.g.
|
||||
```shell
|
||||
pip install 'docling-core[chunking]'
|
||||
```
|
||||
or the `chunking-openai` extra if you prefer Open AI tokenizers (tiktoken), e.g.
|
||||
```shell
|
||||
pip install 'docling-core[chunking-openai]'
|
||||
```
|
||||
and then you
|
||||
can import as follows:
|
||||
```python
|
||||
|
||||
40
docs/concepts/serialization.md
Normal file
40
docs/concepts/serialization.md
Normal file
@@ -0,0 +1,40 @@
|
||||
## Introduction
|
||||
|
||||
A *document serializer* (AKA simply *serializer*) is a Docling abstraction that is
|
||||
initialized with a given [`DoclingDocument`](./docling_document.md) and returns a
|
||||
textual representation for that document.
|
||||
|
||||
Besides the document serializer, Docling defines similar abstractions for several
|
||||
document subcomponents, for example: *text serializer*, *table serializer*,
|
||||
*picture serializer*, *list serializer*, *inline serializer*, and more.
|
||||
|
||||
Last but not least, a *serializer provider* is a wrapper that abstracts the
|
||||
document serialization strategy from the document instance.
|
||||
|
||||
## Base classes
|
||||
|
||||
To enable both flexibility for downstream applications and out-of-the-box utility,
|
||||
Docling defines a serialization class hierarchy, providing:
|
||||
|
||||
- base types for the above abstractions: `BaseDocSerializer`, as well as
|
||||
`BaseTextSerializer`, `BaseTableSerializer` etc, and `BaseSerializerProvider`, and
|
||||
- specific subclasses for the above-mentioned base types, e.g. `MarkdownDocSerializer`.
|
||||
|
||||
You can review all methods required to define the above base classes [here](https://github.com/docling-project/docling-core/blob/main/docling_core/transforms/serializer/base.py).
|
||||
|
||||
From a client perspective, the most relevant is `BaseDocSerializer.serialize()`, which
|
||||
returns the textual representation, as well as relevant metadata on which document
|
||||
components contributed to that serialization.
|
||||
|
||||
## Use in `DoclingDocument` export methods
|
||||
|
||||
Docling provides predefined serializers for Markdown, HTML, and DocTags.
|
||||
|
||||
The respective `DoclingDocument` export methods (e.g. `export_to_markdown()`) are
|
||||
provided as user shorthands — internally directly instantiating and delegating to
|
||||
respective serializers.
|
||||
|
||||
## Examples
|
||||
|
||||
For an example showcasing how to use serializers, see
|
||||
[here](../examples/serialization.ipynb).
|
||||
Reference in New Issue
Block a user