docs: add use docling (#150)

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Michele Dolfi
2024-10-17 18:14:48 +02:00
committed by GitHub
parent 24f949ada2
commit 61c092f445
10 changed files with 271 additions and 63 deletions

View File

@@ -1,4 +1,4 @@
With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
pydantic datatype, which can express several features common to documents, such as:
* Text, Tables, Pictures, and more
@@ -9,15 +9,16 @@ pydantic datatype, which can express several features common to documents, such
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
# Example document structures
## Example document structures
To illustrate the features of the `DoclingDocument` format, consider the following side-by-side comparison of a
`DoclingDocument` converted from `test/data/word_sample.docx`. Left side shows snippets from the converted document
serialized as YAML, right side shows the corresponding visual parts in MS Word.
To illustrate the features of the `DoclingDocument` format, in the subsections below we consider the
`DoclingDocument` converted from `tests/data/word_sample.docx` and we present some side-by-side comparisons,
where the left side shows snippets from the converted document
serialized as YAML and the right one shows the corresponding parts of the original MS Word.
## Basic structure
### Basic structure
A `DoclingDocument` exposes top-level fields for the document content, organized in two categories.
A `DoclingDocument` exposes top-level fields for the document content, organized in two categories.
The first category is the _content items_, which are stored in these fields:
- `texts`: All items that have a text representation (paragraph, section heading, equation, ...). Base class is `TextItem`.
@@ -34,32 +35,34 @@ The second category is _content structure_, which is encapsualted in:
- `furniture`: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)
- `groups`: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)
All of the above fields are only storing `NodeItem` instances, which reference children and parents
through JSON pointers.
All of the above fields are only storing `NodeItem` instances, which reference children and parents
through JSON pointers.
The reading order of the document is encapsulated through the `body` tree and the order of _children_ in each item
in the tree.
Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`).
Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`).
![doc_hierarchy_1](../assets/docling_doc_hierarchy_1.png)
## Grouping
### Grouping
Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of
"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the
"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the
top-level `groups` field.
![doc_hierarchy_2](../assets/docling_doc_hierarchy_2.png)
## Tables
<!--
### Tables
TBD
## Pictures
### Pictures
TBD
## Provenance
### Provenance
TBD
TBD
-->