mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-26 20:14:47 +00:00
Merge remote-tracking branch 'origin/main' into md_table_improvements
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
This commit is contained in:
commit
1f47908bc7
6
.github/workflows/checks.yml
vendored
6
.github/workflows/checks.yml
vendored
@ -22,8 +22,8 @@ jobs:
|
||||
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Install tesseract
|
||||
run: sudo apt-get update && sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa tesseract-ocr-script-latn libleptonica-dev libtesseract-dev pkg-config
|
||||
- name: Install tesseract and ffmpeg
|
||||
run: sudo apt-get update && sudo apt-get install -y ffmpeg tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa tesseract-ocr-script-latn libleptonica-dev libtesseract-dev pkg-config
|
||||
- name: Set TESSDATA_PREFIX
|
||||
run: |
|
||||
echo "TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)" >> "$GITHUB_ENV"
|
||||
@ -60,7 +60,7 @@ jobs:
|
||||
run: |
|
||||
for file in docs/examples/*.py; do
|
||||
# Skip batch_convert.py
|
||||
if [[ "$(basename "$file")" =~ ^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model).py ]]; then
|
||||
if [[ "$(basename "$file")" =~ ^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model).py ]]; then
|
||||
echo "Skipping $file"
|
||||
continue
|
||||
fi
|
||||
|
20
CHANGELOG.md
20
CHANGELOG.md
@ -1,3 +1,23 @@
|
||||
## [v2.38.0](https://github.com/docling-project/docling/releases/tag/v2.38.0) - 2025-06-23
|
||||
|
||||
### Feature
|
||||
|
||||
* Support audio input ([#1763](https://github.com/docling-project/docling/issues/1763)) ([`1557e7c`](https://github.com/docling-project/docling/commit/1557e7ce3e036fb51eb118296f5cbff3b6dfbfa7))
|
||||
* **markdown:** Add formatting & improve inline support ([#1804](https://github.com/docling-project/docling/issues/1804)) ([`861abcd`](https://github.com/docling-project/docling/commit/861abcdcb0d406342b9566f81203b87cf32b7ad0))
|
||||
* Maximum image size for Vlm models ([#1802](https://github.com/docling-project/docling/issues/1802)) ([`215b540`](https://github.com/docling-project/docling/commit/215b540f6c078a72464310ef22975ebb6cde4f0a))
|
||||
|
||||
### Fix
|
||||
|
||||
* **docx:** Ensure list items have a list parent ([#1827](https://github.com/docling-project/docling/issues/1827)) ([`d26dac6`](https://github.com/docling-project/docling/commit/d26dac61a86b0af5b16686f78956ba047bcbddba))
|
||||
* **msword_backend:** Identify text in the same line after an image #1425 ([#1610](https://github.com/docling-project/docling/issues/1610)) ([`1350a8d`](https://github.com/docling-project/docling/commit/1350a8d3e5ea3c4b4d506757758880c8f78efd8c))
|
||||
* Ensure uninitialized pages are removed before assembling document ([#1812](https://github.com/docling-project/docling/issues/1812)) ([`dd7f64f`](https://github.com/docling-project/docling/commit/dd7f64ff28226cd9964fc4d8ba807b2c8a6358ef))
|
||||
* Formula conversion with page_range param set ([#1791](https://github.com/docling-project/docling/issues/1791)) ([`dbab30e`](https://github.com/docling-project/docling/commit/dbab30e92cc1d130ce7f9335ab9c46aa7a30930d))
|
||||
|
||||
### Documentation
|
||||
|
||||
* Update readme and add ASR example ([#1836](https://github.com/docling-project/docling/issues/1836)) ([`f3ae302`](https://github.com/docling-project/docling/commit/f3ae3029b8a6d6f0109383fbc82ebf9da3942afd))
|
||||
* Support running examples from root or subfolder ([#1816](https://github.com/docling-project/docling/issues/1816)) ([`64ac043`](https://github.com/docling-project/docling/commit/64ac043786efdece0c61827051a5b41dddf6c5d7))
|
||||
|
||||
## [v2.37.0](https://github.com/docling-project/docling/releases/tag/v2.37.0) - 2025-06-16
|
||||
|
||||
### Feature
|
||||
|
@ -28,14 +28,15 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
||||
|
||||
## Features
|
||||
|
||||
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
||||
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
|
||||
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
|
||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||
* 🥚 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
|
||||
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
|
||||
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models
|
||||
* 💻 Simple and convenient CLI
|
||||
|
||||
### Coming soon
|
||||
|
@ -2,9 +2,10 @@ import logging
|
||||
import re
|
||||
import warnings
|
||||
from copy import deepcopy
|
||||
from enum import Enum
|
||||
from io import BytesIO
|
||||
from pathlib import Path
|
||||
from typing import List, Optional, Set, Union
|
||||
from typing import List, Literal, Optional, Set, Union
|
||||
|
||||
import marko
|
||||
import marko.element
|
||||
@ -21,7 +22,8 @@ from docling_core.types.doc import (
|
||||
)
|
||||
from docling_core.types.doc.document import Formatting, OrderedList, UnorderedList
|
||||
from marko import Markdown
|
||||
from pydantic import AnyUrl, TypeAdapter
|
||||
from pydantic import AnyUrl, BaseModel, Field, TypeAdapter
|
||||
from typing_extensions import Annotated
|
||||
|
||||
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
||||
from docling.backend.html_backend import HTMLDocumentBackend
|
||||
@ -35,6 +37,31 @@ _START_MARKER = f"#_#_{_MARKER_BODY}_START_#_#"
|
||||
_STOP_MARKER = f"#_#_{_MARKER_BODY}_STOP_#_#"
|
||||
|
||||
|
||||
class _PendingCreationType(str, Enum):
|
||||
"""CoordOrigin."""
|
||||
|
||||
HEADING = "heading"
|
||||
LIST_ITEM = "list_item"
|
||||
|
||||
|
||||
class _HeadingCreationPayload(BaseModel):
|
||||
kind: Literal["heading"] = "heading"
|
||||
level: int
|
||||
|
||||
|
||||
class _ListItemCreationPayload(BaseModel):
|
||||
kind: Literal["list_item"] = "list_item"
|
||||
|
||||
|
||||
_CreationPayload = Annotated[
|
||||
Union[
|
||||
_HeadingCreationPayload,
|
||||
_ListItemCreationPayload,
|
||||
],
|
||||
Field(discriminator="kind"),
|
||||
]
|
||||
|
||||
|
||||
class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
def _shorten_underscore_sequences(self, markdown_text: str, max_length: int = 10):
|
||||
# This regex will match any sequence of underscores
|
||||
@ -155,6 +182,52 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
doc.add_table(data=table_data)
|
||||
return
|
||||
|
||||
def _create_list_item(
|
||||
self,
|
||||
doc: DoclingDocument,
|
||||
parent_item: Optional[NodeItem],
|
||||
text: str,
|
||||
formatting: Optional[Formatting] = None,
|
||||
hyperlink: Optional[Union[AnyUrl, Path]] = None,
|
||||
):
|
||||
if not isinstance(parent_item, (OrderedList, UnorderedList)):
|
||||
_log.warning("ListItem would have not had a list parent, adding one.")
|
||||
parent_item = doc.add_unordered_list(parent=parent_item)
|
||||
item = doc.add_list_item(
|
||||
text=text,
|
||||
enumerated=(isinstance(parent_item, OrderedList)),
|
||||
parent=parent_item,
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
return item
|
||||
|
||||
def _create_heading_item(
|
||||
self,
|
||||
doc: DoclingDocument,
|
||||
parent_item: Optional[NodeItem],
|
||||
text: str,
|
||||
level: int,
|
||||
formatting: Optional[Formatting] = None,
|
||||
hyperlink: Optional[Union[AnyUrl, Path]] = None,
|
||||
):
|
||||
if level == 1:
|
||||
item = doc.add_title(
|
||||
text=text,
|
||||
parent=parent_item,
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
else:
|
||||
item = doc.add_heading(
|
||||
text=text,
|
||||
level=level - 1,
|
||||
parent=parent_item,
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
return item
|
||||
|
||||
def _iterate_elements( # noqa: C901
|
||||
self,
|
||||
*,
|
||||
@ -162,6 +235,9 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
depth: int,
|
||||
doc: DoclingDocument,
|
||||
visited: Set[marko.element.Element],
|
||||
creation_stack: list[
|
||||
_CreationPayload
|
||||
], # stack for lazy item creation triggered deep in marko's AST (on RawText)
|
||||
parent_item: Optional[NodeItem] = None,
|
||||
formatting: Optional[Formatting] = None,
|
||||
hyperlink: Optional[Union[AnyUrl, Path]] = None,
|
||||
@ -177,28 +253,17 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
f" - Heading level {element.level}, content: {element.children[0].children}" # type: ignore
|
||||
)
|
||||
|
||||
if len(element.children) == 1:
|
||||
child = element.children[0]
|
||||
snippet_text = str(child.children) # type: ignore
|
||||
visited.add(child)
|
||||
else:
|
||||
snippet_text = "" # inline group will be created
|
||||
|
||||
if element.level == 1:
|
||||
parent_item = doc.add_title(
|
||||
text=snippet_text,
|
||||
parent=parent_item,
|
||||
if len(element.children) > 1: # inline group will be created further down
|
||||
parent_item = self._create_heading_item(
|
||||
doc=doc,
|
||||
parent_item=parent_item,
|
||||
text="",
|
||||
level=element.level,
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
else:
|
||||
parent_item = doc.add_heading(
|
||||
text=snippet_text,
|
||||
level=element.level - 1,
|
||||
parent=parent_item,
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
creation_stack.append(_HeadingCreationPayload(level=element.level))
|
||||
|
||||
elif isinstance(element, marko.block.List):
|
||||
has_non_empty_list_items = False
|
||||
@ -224,22 +289,16 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
self._close_table(doc)
|
||||
_log.debug(" - List item")
|
||||
|
||||
if len(child.children) == 1:
|
||||
snippet_text = str(child.children[0].children) # type: ignore
|
||||
visited.add(child)
|
||||
else:
|
||||
snippet_text = "" # inline group will be created
|
||||
is_numbered = isinstance(parent_item, OrderedList)
|
||||
if not isinstance(parent_item, (OrderedList, UnorderedList)):
|
||||
_log.warning("ListItem would have not had a list parent, adding one.")
|
||||
parent_item = doc.add_unordered_list(parent=parent_item)
|
||||
parent_item = doc.add_list_item(
|
||||
enumerated=is_numbered,
|
||||
parent=parent_item,
|
||||
text=snippet_text,
|
||||
if len(child.children) > 1: # inline group will be created further down
|
||||
parent_item = self._create_list_item(
|
||||
doc=doc,
|
||||
parent_item=parent_item,
|
||||
text="",
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
else:
|
||||
creation_stack.append(_ListItemCreationPayload())
|
||||
|
||||
elif isinstance(element, marko.inline.Image):
|
||||
self._close_table(doc)
|
||||
@ -285,6 +344,31 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
self.md_table_buffer.append(snippet_text)
|
||||
elif snippet_text:
|
||||
self._close_table(doc)
|
||||
|
||||
if creation_stack:
|
||||
while len(creation_stack) > 0:
|
||||
to_create = creation_stack.pop()
|
||||
if isinstance(to_create, _ListItemCreationPayload):
|
||||
parent_item = self._create_list_item(
|
||||
doc=doc,
|
||||
parent_item=parent_item,
|
||||
text=snippet_text,
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
elif isinstance(to_create, _HeadingCreationPayload):
|
||||
# not keeping as parent_item as logic for correctly tracking
|
||||
# that not implemented yet (section components not captured
|
||||
# as heading children in marko)
|
||||
self._create_heading_item(
|
||||
doc=doc,
|
||||
parent_item=parent_item,
|
||||
text=snippet_text,
|
||||
level=to_create.level,
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
else:
|
||||
doc.add_text(
|
||||
label=DocItemLabel.TEXT,
|
||||
parent=parent_item,
|
||||
@ -353,7 +437,6 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
parent_item = doc.add_inline_group(parent=parent_item)
|
||||
|
||||
processed_block_types = (
|
||||
# marko.block.Heading,
|
||||
marko.block.CodeBlock,
|
||||
marko.block.FencedCode,
|
||||
marko.inline.RawText,
|
||||
@ -369,6 +452,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
depth=depth + 1,
|
||||
doc=doc,
|
||||
visited=visited,
|
||||
creation_stack=creation_stack,
|
||||
parent_item=parent_item,
|
||||
formatting=formatting,
|
||||
hyperlink=hyperlink,
|
||||
@ -412,6 +496,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
doc=doc,
|
||||
parent_item=None,
|
||||
visited=set(),
|
||||
creation_stack=[],
|
||||
)
|
||||
self._close_table(doc=doc) # handle any last hanging table
|
||||
|
||||
|
@ -14,7 +14,7 @@ from docling_core.types.doc import (
|
||||
TableCell,
|
||||
TableData,
|
||||
)
|
||||
from docling_core.types.doc.document import Formatting
|
||||
from docling_core.types.doc.document import Formatting, OrderedList, UnorderedList
|
||||
from docx import Document
|
||||
from docx.document import Document as DocxDocument
|
||||
from docx.oxml.table import CT_Tc
|
||||
@ -84,7 +84,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
self.valid = True
|
||||
except Exception as e:
|
||||
raise RuntimeError(
|
||||
f"MsPowerpointDocumentBackend could not load document with hash {self.document_hash}"
|
||||
f"MsWordDocumentBackend could not load document with hash {self.document_hash}"
|
||||
) from e
|
||||
|
||||
@override
|
||||
@ -251,9 +251,15 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
self._handle_tables(element, docx_obj, doc)
|
||||
except Exception:
|
||||
_log.debug("could not parse a table, broken docx table")
|
||||
|
||||
# Check for Image
|
||||
elif drawing_blip:
|
||||
self._handle_pictures(docx_obj, drawing_blip, doc)
|
||||
# Check for Text after the Image
|
||||
if (
|
||||
tag_name in ["p"]
|
||||
and element.find(".//w:t", namespaces=namespaces) is not None
|
||||
):
|
||||
self._handle_text_elements(element, docx_obj, doc)
|
||||
# Check for the sdt containers, like table of contents
|
||||
elif tag_name in ["sdt"]:
|
||||
sdt_content = element.find(".//w:sdtContent", namespaces=namespaces)
|
||||
@ -268,6 +274,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
self._handle_text_elements(element, docx_obj, doc)
|
||||
else:
|
||||
_log.debug(f"Ignoring element in DOCX with tag: {tag_name}")
|
||||
|
||||
return doc
|
||||
|
||||
def _str_to_int(
|
||||
@ -390,7 +397,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
if isinstance(c, Hyperlink):
|
||||
text = c.text
|
||||
hyperlink = Path(c.address)
|
||||
format = self._get_format_from_run(c.runs[0])
|
||||
format = (
|
||||
self._get_format_from_run(c.runs[0])
|
||||
if c.runs and len(c.runs) > 0
|
||||
else None
|
||||
)
|
||||
elif isinstance(c, Run):
|
||||
text = c.text
|
||||
hyperlink = None
|
||||
@ -578,7 +589,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
all_paragraphs = []
|
||||
|
||||
# Sort paragraphs within each container, then process containers
|
||||
for container_id, paragraphs in container_paragraphs.items():
|
||||
for paragraphs in container_paragraphs.values():
|
||||
# Sort by vertical position within each container
|
||||
sorted_container_paragraphs = sorted(
|
||||
paragraphs,
|
||||
@ -689,14 +700,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
doc: DoclingDocument,
|
||||
) -> None:
|
||||
paragraph = Paragraph(element, docx_obj)
|
||||
|
||||
paragraph_elements = self._get_paragraph_elements(paragraph)
|
||||
text, equations = self._handle_equations_in_text(
|
||||
element=element, text=paragraph.text
|
||||
)
|
||||
|
||||
if text is None:
|
||||
return
|
||||
paragraph_elements = self._get_paragraph_elements(paragraph)
|
||||
text = text.strip()
|
||||
|
||||
# Common styles for bullet and numbered lists.
|
||||
@ -912,6 +922,44 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
)
|
||||
return
|
||||
|
||||
def _add_formatted_list_item(
|
||||
self,
|
||||
doc: DoclingDocument,
|
||||
elements: list,
|
||||
marker: str,
|
||||
enumerated: bool,
|
||||
level: int,
|
||||
) -> None:
|
||||
# This should not happen by construction
|
||||
if not isinstance(self.parents[level], (OrderedList, UnorderedList)):
|
||||
return
|
||||
if len(elements) == 1:
|
||||
text, format, hyperlink = elements[0]
|
||||
doc.add_list_item(
|
||||
marker=marker,
|
||||
enumerated=enumerated,
|
||||
parent=self.parents[level],
|
||||
text=text,
|
||||
formatting=format,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
else:
|
||||
new_item = doc.add_list_item(
|
||||
marker=marker,
|
||||
enumerated=enumerated,
|
||||
parent=self.parents[level],
|
||||
text="",
|
||||
)
|
||||
new_parent = doc.add_group(label=GroupLabel.INLINE, parent=new_item)
|
||||
for text, format, hyperlink in elements:
|
||||
doc.add_text(
|
||||
label=DocItemLabel.TEXT,
|
||||
parent=new_parent,
|
||||
text=text,
|
||||
formatting=format,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
|
||||
def _add_list_item(
|
||||
self,
|
||||
*,
|
||||
@ -921,6 +969,9 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
elements: list,
|
||||
is_numbered: bool = False,
|
||||
) -> None:
|
||||
# TODO: this method is always called with is_numbered. Numbered lists should be properly addressed.
|
||||
if not elements:
|
||||
return None
|
||||
enum_marker = ""
|
||||
|
||||
level = self._get_level()
|
||||
@ -937,21 +988,9 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
if is_numbered:
|
||||
enum_marker = str(self.listIter) + "."
|
||||
is_numbered = True
|
||||
new_parent = self._create_or_reuse_parent(
|
||||
doc=doc,
|
||||
prev_parent=self.parents[level],
|
||||
paragraph_elements=elements,
|
||||
self._add_formatted_list_item(
|
||||
doc, elements, enum_marker, is_numbered, level
|
||||
)
|
||||
for text, format, hyperlink in elements:
|
||||
doc.add_list_item(
|
||||
marker=enum_marker,
|
||||
enumerated=is_numbered,
|
||||
parent=new_parent,
|
||||
text=text,
|
||||
formatting=format,
|
||||
hyperlink=hyperlink,
|
||||
)
|
||||
|
||||
elif (
|
||||
self._prev_numid() == numid
|
||||
and self.level_at_new_list is not None
|
||||
@ -981,20 +1020,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
if is_numbered:
|
||||
enum_marker = str(self.listIter) + "."
|
||||
is_numbered = True
|
||||
|
||||
new_parent = self._create_or_reuse_parent(
|
||||
doc=doc,
|
||||
prev_parent=self.parents[self.level_at_new_list + ilevel],
|
||||
paragraph_elements=elements,
|
||||
)
|
||||
for text, format, hyperlink in elements:
|
||||
doc.add_list_item(
|
||||
marker=enum_marker,
|
||||
enumerated=is_numbered,
|
||||
parent=new_parent,
|
||||
text=text,
|
||||
formatting=format,
|
||||
hyperlink=hyperlink,
|
||||
self._add_formatted_list_item(
|
||||
doc,
|
||||
elements,
|
||||
enum_marker,
|
||||
is_numbered,
|
||||
self.level_at_new_list + ilevel,
|
||||
)
|
||||
elif (
|
||||
self._prev_numid() == numid
|
||||
@ -1002,7 +1033,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
and prev_indent is not None
|
||||
and ilevel < prev_indent
|
||||
): # Close list
|
||||
for k, v in self.parents.items():
|
||||
for k in self.parents:
|
||||
if k > self.level_at_new_list + ilevel:
|
||||
self.parents[k] = None
|
||||
|
||||
@ -1011,19 +1042,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
if is_numbered:
|
||||
enum_marker = str(self.listIter) + "."
|
||||
is_numbered = True
|
||||
new_parent = self._create_or_reuse_parent(
|
||||
doc=doc,
|
||||
prev_parent=self.parents[self.level_at_new_list + ilevel],
|
||||
paragraph_elements=elements,
|
||||
)
|
||||
for text, format, hyperlink in elements:
|
||||
doc.add_list_item(
|
||||
marker=enum_marker,
|
||||
enumerated=is_numbered,
|
||||
parent=new_parent,
|
||||
text=text,
|
||||
formatting=format,
|
||||
hyperlink=hyperlink,
|
||||
self._add_formatted_list_item(
|
||||
doc,
|
||||
elements,
|
||||
enum_marker,
|
||||
is_numbered,
|
||||
self.level_at_new_list + ilevel,
|
||||
)
|
||||
self.listIter = 0
|
||||
|
||||
@ -1033,21 +1057,10 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
if is_numbered:
|
||||
enum_marker = str(self.listIter) + "."
|
||||
is_numbered = True
|
||||
new_parent = self._create_or_reuse_parent(
|
||||
doc=doc,
|
||||
prev_parent=self.parents[level - 1],
|
||||
paragraph_elements=elements,
|
||||
)
|
||||
for text, format, hyperlink in elements:
|
||||
# Add the list item to the parent group
|
||||
doc.add_list_item(
|
||||
marker=enum_marker,
|
||||
enumerated=is_numbered,
|
||||
parent=new_parent,
|
||||
text=text,
|
||||
formatting=format,
|
||||
hyperlink=hyperlink,
|
||||
self._add_formatted_list_item(
|
||||
doc, elements, enum_marker, is_numbered, level - 1
|
||||
)
|
||||
|
||||
return
|
||||
|
||||
def _handle_tables(
|
||||
|
51
docling/backend/noop_backend.py
Normal file
51
docling/backend/noop_backend.py
Normal file
@ -0,0 +1,51 @@
|
||||
import logging
|
||||
from io import BytesIO
|
||||
from pathlib import Path
|
||||
from typing import Set, Union
|
||||
|
||||
from docling.backend.abstract_backend import AbstractDocumentBackend
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.document import InputDocument
|
||||
|
||||
_log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class NoOpBackend(AbstractDocumentBackend):
|
||||
"""
|
||||
A no-op backend that only validates input existence.
|
||||
Used e.g. for audio files where actual processing is handled by the ASR pipeline.
|
||||
"""
|
||||
|
||||
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
||||
super().__init__(in_doc, path_or_stream)
|
||||
|
||||
_log.debug(f"NoOpBackend initialized for: {path_or_stream}")
|
||||
|
||||
# Validate input
|
||||
try:
|
||||
if isinstance(self.path_or_stream, BytesIO):
|
||||
# Check if stream has content
|
||||
self.valid = len(self.path_or_stream.getvalue()) > 0
|
||||
_log.debug(
|
||||
f"BytesIO stream length: {len(self.path_or_stream.getvalue())}"
|
||||
)
|
||||
elif isinstance(self.path_or_stream, Path):
|
||||
# Check if file exists
|
||||
self.valid = self.path_or_stream.exists()
|
||||
_log.debug(f"File exists: {self.valid}")
|
||||
else:
|
||||
self.valid = False
|
||||
except Exception as e:
|
||||
_log.error(f"NoOpBackend validation failed: {e}")
|
||||
self.valid = False
|
||||
|
||||
def is_valid(self) -> bool:
|
||||
return self.valid
|
||||
|
||||
@classmethod
|
||||
def supports_pagination(cls) -> bool:
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def supported_formats(cls) -> Set[InputFormat]:
|
||||
return set(InputFormat)
|
@ -29,6 +29,15 @@ from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBacke
|
||||
from docling.backend.pdf_backend import PdfDocumentBackend
|
||||
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
|
||||
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
|
||||
from docling.datamodel.asr_model_specs import (
|
||||
WHISPER_BASE,
|
||||
WHISPER_LARGE,
|
||||
WHISPER_MEDIUM,
|
||||
WHISPER_SMALL,
|
||||
WHISPER_TINY,
|
||||
WHISPER_TURBO,
|
||||
AsrModelType,
|
||||
)
|
||||
from docling.datamodel.base_models import (
|
||||
ConversionStatus,
|
||||
FormatToExtensions,
|
||||
@ -37,12 +46,14 @@ from docling.datamodel.base_models import (
|
||||
)
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AsrPipelineOptions,
|
||||
EasyOcrOptions,
|
||||
OcrOptions,
|
||||
PaginatedPipelineOptions,
|
||||
PdfBackend,
|
||||
PdfPipeline,
|
||||
PdfPipelineOptions,
|
||||
PipelineOptions,
|
||||
ProcessingPipeline,
|
||||
TableFormerMode,
|
||||
VlmPipelineOptions,
|
||||
)
|
||||
@ -54,8 +65,14 @@ from docling.datamodel.vlm_model_specs import (
|
||||
SMOLDOCLING_TRANSFORMERS,
|
||||
VlmModelType,
|
||||
)
|
||||
from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption
|
||||
from docling.document_converter import (
|
||||
AudioFormatOption,
|
||||
DocumentConverter,
|
||||
FormatOption,
|
||||
PdfFormatOption,
|
||||
)
|
||||
from docling.models.factories import get_ocr_factory
|
||||
from docling.pipeline.asr_pipeline import AsrPipeline
|
||||
from docling.pipeline.vlm_pipeline import VlmPipeline
|
||||
|
||||
warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
|
||||
@ -296,13 +313,17 @@ def convert( # noqa: C901
|
||||
),
|
||||
] = ImageRefMode.EMBEDDED,
|
||||
pipeline: Annotated[
|
||||
PdfPipeline,
|
||||
ProcessingPipeline,
|
||||
typer.Option(..., help="Choose the pipeline to process PDF or image files."),
|
||||
] = PdfPipeline.STANDARD,
|
||||
] = ProcessingPipeline.STANDARD,
|
||||
vlm_model: Annotated[
|
||||
VlmModelType,
|
||||
typer.Option(..., help="Choose the VLM model to use with PDF or image files."),
|
||||
] = VlmModelType.SMOLDOCLING,
|
||||
asr_model: Annotated[
|
||||
AsrModelType,
|
||||
typer.Option(..., help="Choose the ASR model to use with audio/video files."),
|
||||
] = AsrModelType.WHISPER_TINY,
|
||||
ocr: Annotated[
|
||||
bool,
|
||||
typer.Option(
|
||||
@ -450,12 +471,14 @@ def convert( # noqa: C901
|
||||
),
|
||||
] = None,
|
||||
):
|
||||
log_format = "%(asctime)s\t%(levelname)s\t%(name)s: %(message)s"
|
||||
|
||||
if verbose == 0:
|
||||
logging.basicConfig(level=logging.WARNING)
|
||||
logging.basicConfig(level=logging.WARNING, format=log_format)
|
||||
elif verbose == 1:
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logging.basicConfig(level=logging.INFO, format=log_format)
|
||||
else:
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
logging.basicConfig(level=logging.DEBUG, format=log_format)
|
||||
|
||||
settings.debug.visualize_cells = debug_visualize_cells
|
||||
settings.debug.visualize_layout = debug_visualize_layout
|
||||
@ -530,9 +553,12 @@ def convert( # noqa: C901
|
||||
ocr_options.lang = ocr_lang_list
|
||||
|
||||
accelerator_options = AcceleratorOptions(num_threads=num_threads, device=device)
|
||||
pipeline_options: PaginatedPipelineOptions
|
||||
# pipeline_options: PaginatedPipelineOptions
|
||||
pipeline_options: PipelineOptions
|
||||
|
||||
if pipeline == PdfPipeline.STANDARD:
|
||||
format_options: Dict[InputFormat, FormatOption] = {}
|
||||
|
||||
if pipeline == ProcessingPipeline.STANDARD:
|
||||
pipeline_options = PdfPipelineOptions(
|
||||
allow_external_plugins=allow_external_plugins,
|
||||
enable_remote_services=enable_remote_services,
|
||||
@ -574,7 +600,13 @@ def convert( # noqa: C901
|
||||
pipeline_options=pipeline_options,
|
||||
backend=backend, # pdf_backend
|
||||
)
|
||||
elif pipeline == PdfPipeline.VLM:
|
||||
|
||||
format_options = {
|
||||
InputFormat.PDF: pdf_format_option,
|
||||
InputFormat.IMAGE: pdf_format_option,
|
||||
}
|
||||
|
||||
elif pipeline == ProcessingPipeline.VLM:
|
||||
pipeline_options = VlmPipelineOptions(
|
||||
enable_remote_services=enable_remote_services,
|
||||
)
|
||||
@ -600,13 +632,48 @@ def convert( # noqa: C901
|
||||
pipeline_cls=VlmPipeline, pipeline_options=pipeline_options
|
||||
)
|
||||
|
||||
if artifacts_path is not None:
|
||||
pipeline_options.artifacts_path = artifacts_path
|
||||
|
||||
format_options: Dict[InputFormat, FormatOption] = {
|
||||
format_options = {
|
||||
InputFormat.PDF: pdf_format_option,
|
||||
InputFormat.IMAGE: pdf_format_option,
|
||||
}
|
||||
|
||||
elif pipeline == ProcessingPipeline.ASR:
|
||||
pipeline_options = AsrPipelineOptions(
|
||||
# enable_remote_services=enable_remote_services,
|
||||
# artifacts_path = artifacts_path
|
||||
)
|
||||
|
||||
if asr_model == AsrModelType.WHISPER_TINY:
|
||||
pipeline_options.asr_options = WHISPER_TINY
|
||||
elif asr_model == AsrModelType.WHISPER_SMALL:
|
||||
pipeline_options.asr_options = WHISPER_SMALL
|
||||
elif asr_model == AsrModelType.WHISPER_MEDIUM:
|
||||
pipeline_options.asr_options = WHISPER_MEDIUM
|
||||
elif asr_model == AsrModelType.WHISPER_BASE:
|
||||
pipeline_options.asr_options = WHISPER_BASE
|
||||
elif asr_model == AsrModelType.WHISPER_LARGE:
|
||||
pipeline_options.asr_options = WHISPER_LARGE
|
||||
elif asr_model == AsrModelType.WHISPER_TURBO:
|
||||
pipeline_options.asr_options = WHISPER_TURBO
|
||||
else:
|
||||
_log.error(f"{asr_model} is not known")
|
||||
raise ValueError(f"{asr_model} is not known")
|
||||
|
||||
_log.info(f"pipeline_options: {pipeline_options}")
|
||||
|
||||
audio_format_option = AudioFormatOption(
|
||||
pipeline_cls=AsrPipeline,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
|
||||
format_options = {
|
||||
InputFormat.AUDIO: audio_format_option,
|
||||
}
|
||||
|
||||
if artifacts_path is not None:
|
||||
pipeline_options.artifacts_path = artifacts_path
|
||||
# audio_pipeline_options.artifacts_path = artifacts_path
|
||||
|
||||
doc_converter = DocumentConverter(
|
||||
allowed_formats=from_formats,
|
||||
format_options=format_options,
|
||||
@ -614,6 +681,7 @@ def convert( # noqa: C901
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
_log.info(f"paths: {input_doc_paths}")
|
||||
conv_results = doc_converter.convert_all(
|
||||
input_doc_paths, headers=parsed_headers, raises_on_error=abort_on_error
|
||||
)
|
||||
|
92
docling/datamodel/asr_model_specs.py
Normal file
92
docling/datamodel/asr_model_specs.py
Normal file
@ -0,0 +1,92 @@
|
||||
import logging
|
||||
from enum import Enum
|
||||
|
||||
from pydantic import (
|
||||
AnyUrl,
|
||||
)
|
||||
|
||||
from docling.datamodel.accelerator_options import AcceleratorDevice
|
||||
from docling.datamodel.pipeline_options_asr_model import (
|
||||
# AsrResponseFormat,
|
||||
# ApiAsrOptions,
|
||||
InferenceAsrFramework,
|
||||
InlineAsrNativeWhisperOptions,
|
||||
TransformersModelType,
|
||||
)
|
||||
|
||||
_log = logging.getLogger(__name__)
|
||||
|
||||
WHISPER_TINY = InlineAsrNativeWhisperOptions(
|
||||
repo_id="tiny",
|
||||
inference_framework=InferenceAsrFramework.WHISPER,
|
||||
verbose=True,
|
||||
timestamps=True,
|
||||
word_timestamps=True,
|
||||
temperatue=0.0,
|
||||
max_new_tokens=256,
|
||||
max_time_chunk=30.0,
|
||||
)
|
||||
|
||||
WHISPER_SMALL = InlineAsrNativeWhisperOptions(
|
||||
repo_id="small",
|
||||
inference_framework=InferenceAsrFramework.WHISPER,
|
||||
verbose=True,
|
||||
timestamps=True,
|
||||
word_timestamps=True,
|
||||
temperatue=0.0,
|
||||
max_new_tokens=256,
|
||||
max_time_chunk=30.0,
|
||||
)
|
||||
|
||||
WHISPER_MEDIUM = InlineAsrNativeWhisperOptions(
|
||||
repo_id="medium",
|
||||
inference_framework=InferenceAsrFramework.WHISPER,
|
||||
verbose=True,
|
||||
timestamps=True,
|
||||
word_timestamps=True,
|
||||
temperatue=0.0,
|
||||
max_new_tokens=256,
|
||||
max_time_chunk=30.0,
|
||||
)
|
||||
|
||||
WHISPER_BASE = InlineAsrNativeWhisperOptions(
|
||||
repo_id="base",
|
||||
inference_framework=InferenceAsrFramework.WHISPER,
|
||||
verbose=True,
|
||||
timestamps=True,
|
||||
word_timestamps=True,
|
||||
temperatue=0.0,
|
||||
max_new_tokens=256,
|
||||
max_time_chunk=30.0,
|
||||
)
|
||||
|
||||
WHISPER_LARGE = InlineAsrNativeWhisperOptions(
|
||||
repo_id="large",
|
||||
inference_framework=InferenceAsrFramework.WHISPER,
|
||||
verbose=True,
|
||||
timestamps=True,
|
||||
word_timestamps=True,
|
||||
temperatue=0.0,
|
||||
max_new_tokens=256,
|
||||
max_time_chunk=30.0,
|
||||
)
|
||||
|
||||
WHISPER_TURBO = InlineAsrNativeWhisperOptions(
|
||||
repo_id="turbo",
|
||||
inference_framework=InferenceAsrFramework.WHISPER,
|
||||
verbose=True,
|
||||
timestamps=True,
|
||||
word_timestamps=True,
|
||||
temperatue=0.0,
|
||||
max_new_tokens=256,
|
||||
max_time_chunk=30.0,
|
||||
)
|
||||
|
||||
|
||||
class AsrModelType(str, Enum):
|
||||
WHISPER_TINY = "whisper_tiny"
|
||||
WHISPER_SMALL = "whisper_small"
|
||||
WHISPER_MEDIUM = "whisper_medium"
|
||||
WHISPER_BASE = "whisper_base"
|
||||
WHISPER_LARGE = "whisper_large"
|
||||
WHISPER_TURBO = "whisper_turbo"
|
@ -49,6 +49,7 @@ class InputFormat(str, Enum):
|
||||
XML_USPTO = "xml_uspto"
|
||||
XML_JATS = "xml_jats"
|
||||
JSON_DOCLING = "json_docling"
|
||||
AUDIO = "audio"
|
||||
|
||||
|
||||
class OutputFormat(str, Enum):
|
||||
@ -73,6 +74,7 @@ FormatToExtensions: Dict[InputFormat, List[str]] = {
|
||||
InputFormat.XLSX: ["xlsx", "xlsm"],
|
||||
InputFormat.XML_USPTO: ["xml", "txt"],
|
||||
InputFormat.JSON_DOCLING: ["json"],
|
||||
InputFormat.AUDIO: ["wav", "mp3"],
|
||||
}
|
||||
|
||||
FormatToMimeType: Dict[InputFormat, List[str]] = {
|
||||
@ -104,6 +106,7 @@ FormatToMimeType: Dict[InputFormat, List[str]] = {
|
||||
],
|
||||
InputFormat.XML_USPTO: ["application/xml", "text/plain"],
|
||||
InputFormat.JSON_DOCLING: ["application/json"],
|
||||
InputFormat.AUDIO: ["audio/x-wav", "audio/mpeg", "audio/wav", "audio/mp3"],
|
||||
}
|
||||
|
||||
MimeTypeToFormat: dict[str, list[InputFormat]] = {
|
||||
@ -298,7 +301,7 @@ class OpenAiChatMessage(BaseModel):
|
||||
class OpenAiResponseChoice(BaseModel):
|
||||
index: int
|
||||
message: OpenAiChatMessage
|
||||
finish_reason: str
|
||||
finish_reason: Optional[str]
|
||||
|
||||
|
||||
class OpenAiResponseUsage(BaseModel):
|
||||
|
@ -249,7 +249,7 @@ class _DocumentConversionInput(BaseModel):
|
||||
backend: Type[AbstractDocumentBackend]
|
||||
if format not in format_options.keys():
|
||||
_log.error(
|
||||
f"Input document {obj.name} does not match any allowed format."
|
||||
f"Input document {obj.name} with format {format} does not match any allowed format: ({format_options.keys()})"
|
||||
)
|
||||
backend = _DummyBackend
|
||||
else:
|
||||
@ -318,6 +318,8 @@ class _DocumentConversionInput(BaseModel):
|
||||
mime = mime or _DocumentConversionInput._detect_csv(content)
|
||||
mime = mime or "text/plain"
|
||||
formats = MimeTypeToFormat.get(mime, [])
|
||||
_log.info(f"detected formats: {formats}")
|
||||
|
||||
if formats:
|
||||
if len(formats) == 1 and mime not in ("text/plain"):
|
||||
return formats[0]
|
||||
|
@ -11,8 +11,13 @@ from pydantic import (
|
||||
)
|
||||
from typing_extensions import deprecated
|
||||
|
||||
from docling.datamodel import asr_model_specs
|
||||
|
||||
# Import the following for backwards compatibility
|
||||
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
|
||||
from docling.datamodel.pipeline_options_asr_model import (
|
||||
InlineAsrOptions,
|
||||
)
|
||||
from docling.datamodel.pipeline_options_vlm_model import (
|
||||
ApiVlmOptions,
|
||||
InferenceFramework,
|
||||
@ -202,7 +207,7 @@ smolvlm_picture_description = PictureDescriptionVlmOptions(
|
||||
|
||||
# GraniteVision
|
||||
granite_picture_description = PictureDescriptionVlmOptions(
|
||||
repo_id="ibm-granite/granite-vision-3.1-2b-preview",
|
||||
repo_id="ibm-granite/granite-vision-3.2-2b-preview",
|
||||
prompt="What is shown in this image?",
|
||||
)
|
||||
|
||||
@ -260,6 +265,11 @@ class VlmPipelineOptions(PaginatedPipelineOptions):
|
||||
)
|
||||
|
||||
|
||||
class AsrPipelineOptions(PipelineOptions):
|
||||
asr_options: Union[InlineAsrOptions] = asr_model_specs.WHISPER_TINY
|
||||
artifacts_path: Optional[Union[Path, str]] = None
|
||||
|
||||
|
||||
class PdfPipelineOptions(PaginatedPipelineOptions):
|
||||
"""Options for the PDF pipeline."""
|
||||
|
||||
@ -297,6 +307,7 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
|
||||
)
|
||||
|
||||
|
||||
class PdfPipeline(str, Enum):
|
||||
class ProcessingPipeline(str, Enum):
|
||||
STANDARD = "standard"
|
||||
VLM = "vlm"
|
||||
ASR = "asr"
|
||||
|
57
docling/datamodel/pipeline_options_asr_model.py
Normal file
57
docling/datamodel/pipeline_options_asr_model.py
Normal file
@ -0,0 +1,57 @@
|
||||
from enum import Enum
|
||||
from typing import Any, Dict, List, Literal, Optional, Union
|
||||
|
||||
from pydantic import AnyUrl, BaseModel
|
||||
from typing_extensions import deprecated
|
||||
|
||||
from docling.datamodel.accelerator_options import AcceleratorDevice
|
||||
from docling.datamodel.pipeline_options_vlm_model import (
|
||||
# InferenceFramework,
|
||||
TransformersModelType,
|
||||
)
|
||||
|
||||
|
||||
class BaseAsrOptions(BaseModel):
|
||||
kind: str
|
||||
# prompt: str
|
||||
|
||||
|
||||
class InferenceAsrFramework(str, Enum):
|
||||
# MLX = "mlx" # disabled for now
|
||||
# TRANSFORMERS = "transformers" # disabled for now
|
||||
WHISPER = "whisper"
|
||||
|
||||
|
||||
class InlineAsrOptions(BaseAsrOptions):
|
||||
kind: Literal["inline_model_options"] = "inline_model_options"
|
||||
|
||||
repo_id: str
|
||||
|
||||
verbose: bool = False
|
||||
timestamps: bool = True
|
||||
|
||||
temperature: float = 0.0
|
||||
max_new_tokens: int = 256
|
||||
max_time_chunk: float = 30.0
|
||||
|
||||
torch_dtype: Optional[str] = None
|
||||
supported_devices: List[AcceleratorDevice] = [
|
||||
AcceleratorDevice.CPU,
|
||||
AcceleratorDevice.CUDA,
|
||||
AcceleratorDevice.MPS,
|
||||
]
|
||||
|
||||
@property
|
||||
def repo_cache_folder(self) -> str:
|
||||
return self.repo_id.replace("/", "--")
|
||||
|
||||
|
||||
class InlineAsrNativeWhisperOptions(InlineAsrOptions):
|
||||
inference_framework: InferenceAsrFramework = InferenceAsrFramework.WHISPER
|
||||
|
||||
language: str = "en"
|
||||
supported_devices: List[AcceleratorDevice] = [
|
||||
AcceleratorDevice.CPU,
|
||||
AcceleratorDevice.CUDA,
|
||||
]
|
||||
word_timestamps: bool = True
|
@ -19,6 +19,7 @@ from docling.backend.md_backend import MarkdownDocumentBackend
|
||||
from docling.backend.msexcel_backend import MsExcelDocumentBackend
|
||||
from docling.backend.mspowerpoint_backend import MsPowerpointDocumentBackend
|
||||
from docling.backend.msword_backend import MsWordDocumentBackend
|
||||
from docling.backend.noop_backend import NoOpBackend
|
||||
from docling.backend.xml.jats_backend import JatsDocumentBackend
|
||||
from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend
|
||||
from docling.datamodel.base_models import (
|
||||
@ -41,6 +42,7 @@ from docling.datamodel.settings import (
|
||||
settings,
|
||||
)
|
||||
from docling.exceptions import ConversionError
|
||||
from docling.pipeline.asr_pipeline import AsrPipeline
|
||||
from docling.pipeline.base_pipeline import BasePipeline
|
||||
from docling.pipeline.simple_pipeline import SimplePipeline
|
||||
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
|
||||
@ -118,6 +120,11 @@ class PdfFormatOption(FormatOption):
|
||||
backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend
|
||||
|
||||
|
||||
class AudioFormatOption(FormatOption):
|
||||
pipeline_cls: Type = AsrPipeline
|
||||
backend: Type[AbstractDocumentBackend] = NoOpBackend
|
||||
|
||||
|
||||
def _get_default_option(format: InputFormat) -> FormatOption:
|
||||
format_to_default_options = {
|
||||
InputFormat.CSV: FormatOption(
|
||||
@ -156,6 +163,7 @@ def _get_default_option(format: InputFormat) -> FormatOption:
|
||||
InputFormat.JSON_DOCLING: FormatOption(
|
||||
pipeline_cls=SimplePipeline, backend=DoclingJSONBackend
|
||||
),
|
||||
InputFormat.AUDIO: FormatOption(pipeline_cls=AsrPipeline, backend=NoOpBackend),
|
||||
}
|
||||
if (options := format_to_default_options.get(format)) is not None:
|
||||
return options
|
||||
|
@ -124,7 +124,7 @@ class ReadingOrderModel:
|
||||
page_no = page.page_no + 1
|
||||
size = page.size
|
||||
|
||||
assert size is not None
|
||||
assert size is not None, "Page size is not initialized."
|
||||
|
||||
out_doc.add_page(page_no=page_no, size=size)
|
||||
|
||||
|
253
docling/pipeline/asr_pipeline.py
Normal file
253
docling/pipeline/asr_pipeline.py
Normal file
@ -0,0 +1,253 @@
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from io import BytesIO
|
||||
from pathlib import Path
|
||||
from typing import List, Optional, Union, cast
|
||||
|
||||
from docling_core.types.doc import DoclingDocument, DocumentOrigin
|
||||
|
||||
# import whisper # type: ignore
|
||||
# import librosa
|
||||
# import numpy as np
|
||||
# import soundfile as sf # type: ignore
|
||||
from docling_core.types.doc.labels import DocItemLabel
|
||||
from pydantic import BaseModel, Field, validator
|
||||
|
||||
from docling.backend.abstract_backend import AbstractDocumentBackend
|
||||
from docling.backend.noop_backend import NoOpBackend
|
||||
|
||||
# from pydub import AudioSegment # type: ignore
|
||||
# from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline
|
||||
from docling.datamodel.accelerator_options import (
|
||||
AcceleratorOptions,
|
||||
)
|
||||
from docling.datamodel.base_models import (
|
||||
ConversionStatus,
|
||||
FormatToMimeType,
|
||||
)
|
||||
from docling.datamodel.document import ConversionResult, InputDocument
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AsrPipelineOptions,
|
||||
)
|
||||
from docling.datamodel.pipeline_options_asr_model import (
|
||||
InlineAsrNativeWhisperOptions,
|
||||
# AsrResponseFormat,
|
||||
InlineAsrOptions,
|
||||
)
|
||||
from docling.datamodel.pipeline_options_vlm_model import (
|
||||
InferenceFramework,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.pipeline.base_pipeline import BasePipeline
|
||||
from docling.utils.accelerator_utils import decide_device
|
||||
from docling.utils.profiling import ProfilingScope, TimeRecorder
|
||||
|
||||
_log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class _ConversationWord(BaseModel):
|
||||
text: str
|
||||
start_time: Optional[float] = Field(
|
||||
None, description="Start time in seconds from video start"
|
||||
)
|
||||
end_time: Optional[float] = Field(
|
||||
None, ge=0, description="End time in seconds from video start"
|
||||
)
|
||||
|
||||
|
||||
class _ConversationItem(BaseModel):
|
||||
text: str
|
||||
start_time: Optional[float] = Field(
|
||||
None, description="Start time in seconds from video start"
|
||||
)
|
||||
end_time: Optional[float] = Field(
|
||||
None, ge=0, description="End time in seconds from video start"
|
||||
)
|
||||
speaker_id: Optional[int] = Field(None, description="Numeric speaker identifier")
|
||||
speaker: Optional[str] = Field(
|
||||
None, description="Speaker name, defaults to speaker-{speaker_id}"
|
||||
)
|
||||
words: Optional[list[_ConversationWord]] = Field(
|
||||
None, description="Individual words with time-stamps"
|
||||
)
|
||||
|
||||
def __lt__(self, other):
|
||||
if not isinstance(other, _ConversationItem):
|
||||
return NotImplemented
|
||||
return self.start_time < other.start_time
|
||||
|
||||
def __eq__(self, other):
|
||||
if not isinstance(other, _ConversationItem):
|
||||
return NotImplemented
|
||||
return self.start_time == other.start_time
|
||||
|
||||
def to_string(self) -> str:
|
||||
"""Format the conversation entry as a string"""
|
||||
result = ""
|
||||
if (self.start_time is not None) and (self.end_time is not None):
|
||||
result += f"[time: {self.start_time}-{self.end_time}] "
|
||||
|
||||
if self.speaker is not None:
|
||||
result += f"[speaker:{self.speaker}] "
|
||||
|
||||
result += self.text
|
||||
return result
|
||||
|
||||
|
||||
class _NativeWhisperModel:
|
||||
def __init__(
|
||||
self,
|
||||
enabled: bool,
|
||||
artifacts_path: Optional[Path],
|
||||
accelerator_options: AcceleratorOptions,
|
||||
asr_options: InlineAsrNativeWhisperOptions,
|
||||
):
|
||||
"""
|
||||
Transcriber using native Whisper.
|
||||
"""
|
||||
self.enabled = enabled
|
||||
|
||||
_log.info(f"artifacts-path: {artifacts_path}")
|
||||
_log.info(f"accelerator_options: {accelerator_options}")
|
||||
|
||||
if self.enabled:
|
||||
try:
|
||||
import whisper # type: ignore
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"whisper is not installed. Please install it via `pip install openai-whisper` or do `uv sync --extra asr`."
|
||||
)
|
||||
self.asr_options = asr_options
|
||||
self.max_tokens = asr_options.max_new_tokens
|
||||
self.temperature = asr_options.temperature
|
||||
|
||||
self.device = decide_device(
|
||||
accelerator_options.device,
|
||||
supported_devices=asr_options.supported_devices,
|
||||
)
|
||||
_log.info(f"Available device for Whisper: {self.device}")
|
||||
|
||||
self.model_name = asr_options.repo_id
|
||||
_log.info(f"loading _NativeWhisperModel({self.model_name})")
|
||||
if artifacts_path is not None:
|
||||
_log.info(f"loading {self.model_name} from {artifacts_path}")
|
||||
self.model = whisper.load_model(
|
||||
name=self.model_name,
|
||||
device=self.device,
|
||||
download_root=str(artifacts_path),
|
||||
)
|
||||
else:
|
||||
self.model = whisper.load_model(
|
||||
name=self.model_name, device=self.device
|
||||
)
|
||||
|
||||
self.verbose = asr_options.verbose
|
||||
self.timestamps = asr_options.timestamps
|
||||
self.word_timestamps = asr_options.word_timestamps
|
||||
|
||||
def run(self, conv_res: ConversionResult) -> ConversionResult:
|
||||
audio_path: Path = Path(conv_res.input.file).resolve()
|
||||
|
||||
try:
|
||||
conversation = self.transcribe(audio_path)
|
||||
|
||||
# Ensure we have a proper DoclingDocument
|
||||
origin = DocumentOrigin(
|
||||
filename=conv_res.input.file.name or "audio.wav",
|
||||
mimetype="audio/x-wav",
|
||||
binary_hash=conv_res.input.document_hash,
|
||||
)
|
||||
conv_res.document = DoclingDocument(
|
||||
name=conv_res.input.file.stem or "audio.wav", origin=origin
|
||||
)
|
||||
|
||||
for citem in conversation:
|
||||
conv_res.document.add_text(
|
||||
label=DocItemLabel.TEXT, text=citem.to_string()
|
||||
)
|
||||
|
||||
conv_res.status = ConversionStatus.SUCCESS
|
||||
return conv_res
|
||||
|
||||
except Exception as exc:
|
||||
_log.error(f"Audio tranciption has an error: {exc}")
|
||||
|
||||
conv_res.status = ConversionStatus.FAILURE
|
||||
return conv_res
|
||||
|
||||
def transcribe(self, fpath: Path) -> list[_ConversationItem]:
|
||||
result = self.model.transcribe(
|
||||
str(fpath), verbose=self.verbose, word_timestamps=self.word_timestamps
|
||||
)
|
||||
|
||||
convo: list[_ConversationItem] = []
|
||||
for _ in result["segments"]:
|
||||
item = _ConversationItem(
|
||||
start_time=_["start"], end_time=_["end"], text=_["text"], words=[]
|
||||
)
|
||||
if "words" in _ and self.word_timestamps:
|
||||
item.words = []
|
||||
for __ in _["words"]:
|
||||
item.words.append(
|
||||
_ConversationWord(
|
||||
start_time=__["start"],
|
||||
end_time=__["end"],
|
||||
text=__["word"],
|
||||
)
|
||||
)
|
||||
convo.append(item)
|
||||
|
||||
return convo
|
||||
|
||||
|
||||
class AsrPipeline(BasePipeline):
|
||||
def __init__(self, pipeline_options: AsrPipelineOptions):
|
||||
super().__init__(pipeline_options)
|
||||
self.keep_backend = True
|
||||
|
||||
self.pipeline_options: AsrPipelineOptions = pipeline_options
|
||||
|
||||
artifacts_path: Optional[Path] = None
|
||||
if pipeline_options.artifacts_path is not None:
|
||||
artifacts_path = Path(pipeline_options.artifacts_path).expanduser()
|
||||
elif settings.artifacts_path is not None:
|
||||
artifacts_path = Path(settings.artifacts_path).expanduser()
|
||||
|
||||
if artifacts_path is not None and not artifacts_path.is_dir():
|
||||
raise RuntimeError(
|
||||
f"The value of {artifacts_path=} is not valid. "
|
||||
"When defined, it must point to a folder containing all models required by the pipeline."
|
||||
)
|
||||
|
||||
if isinstance(self.pipeline_options.asr_options, InlineAsrNativeWhisperOptions):
|
||||
asr_options: InlineAsrNativeWhisperOptions = (
|
||||
self.pipeline_options.asr_options
|
||||
)
|
||||
self._model = _NativeWhisperModel(
|
||||
enabled=True, # must be always enabled for this pipeline to make sense.
|
||||
artifacts_path=artifacts_path,
|
||||
accelerator_options=pipeline_options.accelerator_options,
|
||||
asr_options=asr_options,
|
||||
)
|
||||
else:
|
||||
_log.error(f"No model support for {self.pipeline_options.asr_options}")
|
||||
|
||||
def _determine_status(self, conv_res: ConversionResult) -> ConversionStatus:
|
||||
status = ConversionStatus.SUCCESS
|
||||
return status
|
||||
|
||||
@classmethod
|
||||
def get_default_options(cls) -> AsrPipelineOptions:
|
||||
return AsrPipelineOptions()
|
||||
|
||||
def _build_document(self, conv_res: ConversionResult) -> ConversionResult:
|
||||
_log.info(f"start _build_document in AsrPipeline: {conv_res.input.file}")
|
||||
with TimeRecorder(conv_res, "doc_build", scope=ProfilingScope.DOCUMENT):
|
||||
self._model.run(conv_res=conv_res)
|
||||
|
||||
return conv_res
|
||||
|
||||
@classmethod
|
||||
def is_backend_supported(cls, backend: AbstractDocumentBackend):
|
||||
return isinstance(backend, NoOpBackend)
|
@ -193,6 +193,17 @@ class PaginatedPipeline(BasePipeline): # TODO this is a bad name.
|
||||
)
|
||||
raise e
|
||||
|
||||
# Filter out uninitialized pages (those with size=None) that may remain
|
||||
# after timeout or processing failures to prevent assertion errors downstream
|
||||
initial_page_count = len(conv_res.pages)
|
||||
conv_res.pages = [page for page in conv_res.pages if page.size is not None]
|
||||
|
||||
if len(conv_res.pages) < initial_page_count:
|
||||
_log.info(
|
||||
f"Filtered out {initial_page_count - len(conv_res.pages)} uninitialized pages "
|
||||
f"due to timeout or processing failures"
|
||||
)
|
||||
|
||||
return conv_res
|
||||
|
||||
def _unload(self, conv_res: ConversionResult) -> ConversionResult:
|
||||
|
11
docs/examples/batch_convert.py
vendored
11
docs/examples/batch_convert.py
vendored
@ -121,14 +121,15 @@ def export_documents(
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_paths = [
|
||||
Path("./tests/data/pdf/2206.01062.pdf"),
|
||||
Path("./tests/data/pdf/2203.01017v2.pdf"),
|
||||
Path("./tests/data/pdf/2305.03393v1.pdf"),
|
||||
Path("./tests/data/pdf/redp5110_sampled.pdf"),
|
||||
data_folder / "pdf/2206.01062.pdf",
|
||||
data_folder / "pdf/2203.01017v2.pdf",
|
||||
data_folder / "pdf/2305.03393v1.pdf",
|
||||
data_folder / "pdf/redp5110_sampled.pdf",
|
||||
]
|
||||
|
||||
# buf = BytesIO(Path("./test/data/2206.01062.pdf").open("rb").read())
|
||||
# buf = BytesIO((data_folder / "pdf/2206.01062.pdf").open("rb").read())
|
||||
# docs = [DocumentStream(name="my_doc.pdf", stream=buf)]
|
||||
# input = DocumentConversionInput.from_streams(docs)
|
||||
|
||||
|
3
docs/examples/custom_convert.py
vendored
3
docs/examples/custom_convert.py
vendored
@ -16,7 +16,8 @@ _log = logging.getLogger(__name__)
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
|
||||
###########################################################################
|
||||
|
||||
|
@ -71,7 +71,8 @@ class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
input_doc_path = Path("./tests/data/pdf/2203.01017v2.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2203.01017v2.pdf"
|
||||
|
||||
pipeline_options = ExampleFormulaUnderstandingPipelineOptions()
|
||||
pipeline_options.do_formula_understanding = True
|
||||
|
3
docs/examples/develop_picture_enrichment.py
vendored
3
docs/examples/develop_picture_enrichment.py
vendored
@ -76,7 +76,8 @@ class ExamplePictureClassifierPipeline(StandardPdfPipeline):
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
|
||||
pipeline_options = ExamplePictureClassifierPipelineOptions()
|
||||
pipeline_options.images_scale = 2.0
|
||||
|
3
docs/examples/export_figures.py
vendored
3
docs/examples/export_figures.py
vendored
@ -16,7 +16,8 @@ IMAGE_RESOLUTION_SCALE = 2.0
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
output_dir = Path("scratch")
|
||||
|
||||
# Important: For operating with page images, we must keep them, otherwise the DocumentConverter
|
||||
|
3
docs/examples/export_multimodal.py
vendored
3
docs/examples/export_multimodal.py
vendored
@ -19,7 +19,8 @@ IMAGE_RESOLUTION_SCALE = 2.0
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
output_dir = Path("scratch")
|
||||
|
||||
# Important: For operating with page images, we must keep them, otherwise the DocumentConverter
|
||||
|
3
docs/examples/export_tables.py
vendored
3
docs/examples/export_tables.py
vendored
@ -12,7 +12,8 @@ _log = logging.getLogger(__name__)
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
output_dir = Path("scratch")
|
||||
|
||||
doc_converter = DocumentConverter()
|
||||
|
5
docs/examples/full_page_ocr.py
vendored
5
docs/examples/full_page_ocr.py
vendored
@ -9,7 +9,8 @@ from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
|
||||
|
||||
def main():
|
||||
input_doc = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
|
||||
pipeline_options = PdfPipelineOptions()
|
||||
pipeline_options.do_ocr = True
|
||||
@ -32,7 +33,7 @@ def main():
|
||||
}
|
||||
)
|
||||
|
||||
doc = converter.convert(input_doc).document
|
||||
doc = converter.convert(input_doc_path).document
|
||||
md = doc.export_to_markdown()
|
||||
print(md)
|
||||
|
||||
|
56
docs/examples/minimal_asr_pipeline.py
vendored
Normal file
56
docs/examples/minimal_asr_pipeline.py
vendored
Normal file
@ -0,0 +1,56 @@
|
||||
from pathlib import Path
|
||||
|
||||
from docling_core.types.doc import DoclingDocument
|
||||
|
||||
from docling.datamodel import asr_model_specs
|
||||
from docling.datamodel.base_models import ConversionStatus, InputFormat
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import AsrPipelineOptions
|
||||
from docling.document_converter import AudioFormatOption, DocumentConverter
|
||||
from docling.pipeline.asr_pipeline import AsrPipeline
|
||||
|
||||
|
||||
def get_asr_converter():
|
||||
"""Create a DocumentConverter configured for ASR with whisper_turbo model."""
|
||||
pipeline_options = AsrPipelineOptions()
|
||||
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
|
||||
|
||||
converter = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.AUDIO: AudioFormatOption(
|
||||
pipeline_cls=AsrPipeline,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
}
|
||||
)
|
||||
return converter
|
||||
|
||||
|
||||
def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument:
|
||||
"""ASR pipeline conversion using whisper_turbo"""
|
||||
# Check if the test audio file exists
|
||||
assert audio_path.exists(), f"Test audio file not found: {audio_path}"
|
||||
|
||||
converter = get_asr_converter()
|
||||
|
||||
# Convert the audio file
|
||||
result: ConversionResult = converter.convert(audio_path)
|
||||
|
||||
# Verify conversion was successful
|
||||
assert result.status == ConversionStatus.SUCCESS, (
|
||||
f"Conversion failed with status: {result.status}"
|
||||
)
|
||||
return result.document
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_path = Path("tests/data/audio/sample_10s.mp3")
|
||||
|
||||
doc = asr_pipeline_conversion(audio_path=audio_path)
|
||||
print(doc.export_to_markdown())
|
||||
|
||||
# Expected output:
|
||||
#
|
||||
# [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde
|
||||
#
|
||||
# [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.
|
3
docs/examples/pictures_description_api.py
vendored
3
docs/examples/pictures_description_api.py
vendored
@ -96,7 +96,8 @@ def watsonx_vlm_options():
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
|
||||
pipeline_options = PdfPipelineOptions(
|
||||
enable_remote_services=True # <-- this is required!
|
||||
|
5
docs/examples/run_with_accelerator.py
vendored
5
docs/examples/run_with_accelerator.py
vendored
@ -10,7 +10,8 @@ from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
|
||||
|
||||
def main():
|
||||
input_doc = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
|
||||
# Explicitly set the accelerator
|
||||
# accelerator_options = AcceleratorOptions(
|
||||
@ -47,7 +48,7 @@ def main():
|
||||
settings.debug.profile_pipeline_timings = True
|
||||
|
||||
# Convert the document
|
||||
conversion_result = converter.convert(input_doc)
|
||||
conversion_result = converter.convert(input_doc_path)
|
||||
doc = conversion_result.document
|
||||
|
||||
# List with total time per document
|
||||
|
5
docs/examples/tesseract_lang_detection.py
vendored
5
docs/examples/tesseract_lang_detection.py
vendored
@ -9,7 +9,8 @@ from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
|
||||
|
||||
def main():
|
||||
input_doc = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
|
||||
# Set lang=["auto"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions
|
||||
# ocr_options = TesseractOcrOptions(lang=["auto"])
|
||||
@ -27,7 +28,7 @@ def main():
|
||||
}
|
||||
)
|
||||
|
||||
doc = converter.convert(input_doc).document
|
||||
doc = converter.convert(input_doc_path).document
|
||||
md = doc.export_to_markdown()
|
||||
print(md)
|
||||
|
||||
|
3
docs/examples/translate.py
vendored
3
docs/examples/translate.py
vendored
@ -30,7 +30,8 @@ def translate(text: str, src: str = "en", dest: str = "de"):
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2206.01062.pdf"
|
||||
output_dir = Path("scratch")
|
||||
|
||||
# Important: For operating with page images, we must keep them, otherwise the DocumentConverter
|
||||
|
4
docs/examples/vlm_pipeline_api_model.py
vendored
4
docs/examples/vlm_pipeline_api_model.py
vendored
@ -95,8 +95,8 @@ def watsonx_vlm_options(model: str, prompt: str):
|
||||
def main():
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
# input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
|
||||
input_doc_path = Path("./tests/data/pdf/2305.03393v1-pg9.pdf")
|
||||
data_folder = Path(__file__).parent / "../../tests/data"
|
||||
input_doc_path = data_folder / "pdf/2305.03393v1-pg9.pdf"
|
||||
|
||||
pipeline_options = VlmPipelineOptions(
|
||||
enable_remote_services=True # <-- this is required!
|
||||
|
7
docs/index.md
vendored
7
docs/index.md
vendored
@ -20,14 +20,15 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
||||
|
||||
## Features
|
||||
|
||||
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
||||
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
|
||||
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
|
||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||
* 🥚 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🔥
|
||||
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
|
||||
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models
|
||||
* 💻 Simple and convenient CLI
|
||||
|
||||
### Coming soon
|
||||
|
@ -80,6 +80,7 @@ nav:
|
||||
- "VLM pipeline with SmolDocling": examples/minimal_vlm_pipeline.py
|
||||
- "VLM pipeline with remote model": examples/vlm_pipeline_api_model.py
|
||||
- "VLM comparison": examples/compare_vlm_models.py
|
||||
- "ASR pipeline with Whisper": examples/minimal_asr_pipeline.py
|
||||
- "Figure export": examples/export_figures.py
|
||||
- "Table export": examples/export_tables.py
|
||||
- "Multimodal export": examples/export_multimodal.py
|
||||
|
@ -1,6 +1,6 @@
|
||||
[project]
|
||||
name = "docling"
|
||||
version = "2.37.0" # DO NOT EDIT, updated automatically
|
||||
version = "2.38.0" # DO NOT EDIT, updated automatically
|
||||
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
||||
license = "MIT"
|
||||
keywords = [
|
||||
@ -99,6 +99,9 @@ rapidocr = [
|
||||
# 'onnxruntime (>=1.7.0,<2.0.0) ; python_version >= "3.10"',
|
||||
# 'onnxruntime (>=1.7.0,<1.20.0) ; python_version < "3.10"',
|
||||
]
|
||||
asr = [
|
||||
"openai-whisper>=20240930",
|
||||
]
|
||||
|
||||
[dependency-groups]
|
||||
dev = [
|
||||
@ -145,6 +148,9 @@ constraints = [
|
||||
package = true
|
||||
default-groups = "all"
|
||||
|
||||
[tool.uv.sources]
|
||||
openai-whisper = { git = "https://github.com/openai/whisper.git", rev = "dd985ac4b90cafeef8712f2998d62c59c3e62d22" }
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
include = ["docling*"]
|
||||
|
||||
|
BIN
tests/data/audio/sample_10s.mp3
vendored
Normal file
BIN
tests/data/audio/sample_10s.mp3
vendored
Normal file
Binary file not shown.
BIN
tests/data/docx/word_image_anchors.docx
vendored
Normal file
BIN
tests/data/docx/word_image_anchors.docx
vendored
Normal file
Binary file not shown.
@ -2705,7 +2705,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.9373534917831421,
|
||||
"confidence": 0.9373533725738525,
|
||||
"cells": [
|
||||
{
|
||||
"index": 0,
|
||||
@ -2745,7 +2745,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.8858680725097656,
|
||||
"confidence": 0.8858679533004761,
|
||||
"cells": [
|
||||
{
|
||||
"index": 1,
|
||||
@ -13641,7 +13641,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.9373534917831421,
|
||||
"confidence": 0.9373533725738525,
|
||||
"cells": [
|
||||
{
|
||||
"index": 0,
|
||||
@ -13687,7 +13687,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.8858680725097656,
|
||||
"confidence": 0.8858679533004761,
|
||||
"cells": [
|
||||
{
|
||||
"index": 1,
|
||||
@ -26499,7 +26499,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.9373534917831421,
|
||||
"confidence": 0.9373533725738525,
|
||||
"cells": [
|
||||
{
|
||||
"index": 0,
|
||||
@ -26545,7 +26545,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.8858680725097656,
|
||||
"confidence": 0.8858679533004761,
|
||||
"cells": [
|
||||
{
|
||||
"index": 1,
|
||||
|
116
tests/data/groundtruth/docling_v2/2203.01017v2.json
vendored
116
tests/data/groundtruth/docling_v2/2203.01017v2.json
vendored
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "2203.01017v2",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
@ -17863,7 +17863,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/1",
|
||||
@ -18753,7 +18754,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/2",
|
||||
@ -20117,7 +20119,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/3",
|
||||
@ -22266,7 +22269,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/4",
|
||||
@ -22927,7 +22931,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/5",
|
||||
@ -24050,7 +24055,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/6",
|
||||
@ -26307,7 +26313,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/7",
|
||||
@ -27600,7 +27607,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/8",
|
||||
@ -27635,7 +27643,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/9",
|
||||
@ -27670,7 +27679,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/10",
|
||||
@ -27705,7 +27715,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/11",
|
||||
@ -27740,7 +27751,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/12",
|
||||
@ -27783,7 +27795,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/13",
|
||||
@ -27818,7 +27831,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/14",
|
||||
@ -27853,7 +27867,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/15",
|
||||
@ -27888,7 +27903,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/16",
|
||||
@ -27931,7 +27947,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/17",
|
||||
@ -27966,7 +27983,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/18",
|
||||
@ -28001,7 +28019,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/19",
|
||||
@ -28036,7 +28055,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/20",
|
||||
@ -28071,7 +28091,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/21",
|
||||
@ -28106,7 +28127,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/22",
|
||||
@ -28141,7 +28163,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/23",
|
||||
@ -28176,7 +28199,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/24",
|
||||
@ -28211,7 +28235,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/25",
|
||||
@ -28246,7 +28271,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/26",
|
||||
@ -28281,7 +28307,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/27",
|
||||
@ -28324,7 +28351,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/28",
|
||||
@ -28359,7 +28387,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/29",
|
||||
@ -28394,7 +28423,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/30",
|
||||
@ -28429,7 +28459,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/31",
|
||||
@ -28464,7 +28495,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/32",
|
||||
@ -28499,7 +28531,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/33",
|
||||
@ -28542,7 +28575,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/34",
|
||||
@ -28577,7 +28611,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/35",
|
||||
@ -28612,7 +28647,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/36",
|
||||
@ -28647,7 +28683,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/37",
|
||||
@ -28682,7 +28719,8 @@
|
||||
"num_rows": 0,
|
||||
"num_cols": 0,
|
||||
"grid": []
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "2206.01062",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
@ -23491,7 +23491,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/1",
|
||||
@ -26654,7 +26655,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/2",
|
||||
@ -29187,7 +29189,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/3",
|
||||
@ -31574,7 +31577,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/4",
|
||||
@ -34177,7 +34181,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "2305.03393v1-pg9",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
@ -2104,7 +2104,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -2705,7 +2705,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.9373534917831421,
|
||||
"confidence": 0.9373533725738525,
|
||||
"cells": [
|
||||
{
|
||||
"index": 0,
|
||||
@ -2745,7 +2745,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.8858680725097656,
|
||||
"confidence": 0.8858679533004761,
|
||||
"cells": [
|
||||
{
|
||||
"index": 1,
|
||||
@ -13641,7 +13641,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.9373534917831421,
|
||||
"confidence": 0.9373533725738525,
|
||||
"cells": [
|
||||
{
|
||||
"index": 0,
|
||||
@ -13687,7 +13687,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.8858680725097656,
|
||||
"confidence": 0.8858679533004761,
|
||||
"cells": [
|
||||
{
|
||||
"index": 1,
|
||||
@ -26499,7 +26499,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.9373534917831421,
|
||||
"confidence": 0.9373533725738525,
|
||||
"cells": [
|
||||
{
|
||||
"index": 0,
|
||||
@ -26545,7 +26545,7 @@
|
||||
"b": 102.78223000000003,
|
||||
"coord_origin": "TOPLEFT"
|
||||
},
|
||||
"confidence": 0.8858680725097656,
|
||||
"confidence": 0.8858679533004761,
|
||||
"cells": [
|
||||
{
|
||||
"index": 1,
|
||||
|
@ -60,6 +60,8 @@
|
||||
<page_header><loc_159><loc_59><loc_366><loc_64>Optimized Table Tokenization for Table Structure Recognition</page_header>
|
||||
<page_header><loc_389><loc_59><loc_393><loc_64>7</page_header>
|
||||
<picture><loc_135><loc_103><loc_367><loc_177><caption><loc_110><loc_79><loc_393><loc_98>Fig. 3. OTSL description of table structure: A - table example; B - graphical representation of table structure; C - mapping structure on a grid; D - OTSL structure encoding; E - explanation on cell encoding</caption></picture>
|
||||
<unordered_list><list_item><loc_273><loc_172><loc_349><loc_176>4 - 2d merges: "C", "L", "U", "X"</list_item>
|
||||
</unordered_list>
|
||||
<section_header_level_1><loc_110><loc_193><loc_202><loc_198>4.2 Language Syntax</section_header_level_1>
|
||||
<text><loc_110><loc_205><loc_297><loc_211>The OTSL representation follows these syntax rules:</text>
|
||||
<unordered_list><list_item><loc_114><loc_219><loc_393><loc_232>1. Left-looking cell rule : The left neighbour of an "L" cell must be either another "L" cell or a "C" cell.</list_item>
|
||||
|
892
tests/data/groundtruth/docling_v2/2305.03393v1.json
vendored
892
tests/data/groundtruth/docling_v2/2305.03393v1.json
vendored
File diff suppressed because it is too large
Load Diff
@ -84,6 +84,8 @@ Fig. 3. OTSL description of table structure: A - table example; B - graphical re
|
||||
|
||||
<!-- image -->
|
||||
|
||||
- 4 - 2d merges: "C", "L", "U", "X"
|
||||
|
||||
## 4.2 Language Syntax
|
||||
|
||||
The OTSL representation follows these syntax rules:
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "amt_handbook_sample",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "code_and_formula",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "csv-comma-in-cell",
|
||||
"origin": {
|
||||
"mimetype": "text/csv",
|
||||
@ -538,7 +538,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "csv-comma",
|
||||
"origin": {
|
||||
"mimetype": "text/csv",
|
||||
@ -1788,7 +1788,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "csv-inconsistent-header",
|
||||
"origin": {
|
||||
"mimetype": "text/csv",
|
||||
@ -526,7 +526,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "csv-pipe",
|
||||
"origin": {
|
||||
"mimetype": "text/csv",
|
||||
@ -1788,7 +1788,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "csv-semicolon",
|
||||
"origin": {
|
||||
"mimetype": "text/csv",
|
||||
@ -1788,7 +1788,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "csv-tab",
|
||||
"origin": {
|
||||
"mimetype": "text/csv",
|
||||
@ -1788,7 +1788,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "csv-too-few-columns",
|
||||
"origin": {
|
||||
"mimetype": "text/csv",
|
||||
@ -526,7 +526,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "csv-too-many-columns",
|
||||
"origin": {
|
||||
"mimetype": "text/csv",
|
||||
@ -610,7 +610,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "equations",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -250,7 +250,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -280,7 +281,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -322,7 +324,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -436,7 +439,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -466,7 +470,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -520,7 +525,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -634,7 +640,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "example_01",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "example_02",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "example_03",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
@ -637,7 +637,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "example_04",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
@ -325,7 +325,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "example_05",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
@ -325,7 +325,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "example_06",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "example_07",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "example_08",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
@ -661,7 +661,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/1",
|
||||
@ -1330,7 +1331,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/2",
|
||||
@ -1999,7 +2001,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -11,10 +11,12 @@ Create your feature branch: `git checkout -b feature/AmazingFeature` .
|
||||
3. Commit your changes ( `git commit -m 'Add some AmazingFeature'` )
|
||||
4. Push to the branch ( `git push origin feature/AmazingFeature` )
|
||||
5. Open a Pull Request
|
||||
6. **Whole list item has same formatting**
|
||||
7. List item has *mixed or partial* formatting
|
||||
|
||||
##
|
||||
*# Whole heading is italic*
|
||||
|
||||
*Second* section
|
||||
<<<<<<< HEAD
|
||||
|
||||
- **First** : Lorem ipsum.
|
||||
- **Second** : Dolor `sit` amet.
|
||||
@ -22,3 +24,13 @@ Create your feature branch: `git checkout -b feature/AmazingFeature` .
|
||||
| Bold Heading | Italic Heading |
|
||||
|----------------|------------------|
|
||||
| data a | data b |
|
||||
|
||||
Some *`formatted_code`*
|
||||
|
||||
##
|
||||
|
||||
*Partially formatted* heading to\_escape `not_to_escape`
|
||||
|
||||
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
|
||||
|
||||
origin/main
|
||||
|
@ -5,10 +5,14 @@ body:
|
||||
- $ref: '#/groups/0'
|
||||
- $ref: '#/groups/1'
|
||||
- $ref: '#/groups/2'
|
||||
- $ref: '#/texts/27'
|
||||
- $ref: '#/texts/32'
|
||||
- $ref: '#/texts/33'
|
||||
- $ref: '#/groups/8'
|
||||
- $ref: '#/groups/11'
|
||||
- $ref: '#/tables/0'
|
||||
- $ref: '#/groups/11'
|
||||
- $ref: '#/texts/44'
|
||||
- $ref: '#/texts/48'
|
||||
- $ref: '#/texts/49'
|
||||
content_layer: body
|
||||
label: unspecified
|
||||
name: _root_
|
||||
@ -49,6 +53,8 @@ groups:
|
||||
- $ref: '#/texts/18'
|
||||
- $ref: '#/texts/22'
|
||||
- $ref: '#/texts/26'
|
||||
- $ref: '#/texts/27'
|
||||
- $ref: '#/texts/28'
|
||||
content_layer: body
|
||||
label: ordered_list
|
||||
name: list
|
||||
@ -96,17 +102,18 @@ groups:
|
||||
$ref: '#/texts/22'
|
||||
self_ref: '#/groups/6'
|
||||
- children:
|
||||
- $ref: '#/texts/28'
|
||||
- $ref: '#/texts/29'
|
||||
- $ref: '#/texts/30'
|
||||
- $ref: '#/texts/31'
|
||||
content_layer: body
|
||||
label: inline
|
||||
name: group
|
||||
parent:
|
||||
$ref: '#/texts/27'
|
||||
$ref: '#/texts/28'
|
||||
self_ref: '#/groups/7'
|
||||
- children:
|
||||
- $ref: '#/texts/30'
|
||||
- $ref: '#/texts/33'
|
||||
- $ref: '#/texts/34'
|
||||
- $ref: '#/texts/37'
|
||||
content_layer: body
|
||||
label: list
|
||||
name: list
|
||||
@ -114,36 +121,48 @@ groups:
|
||||
$ref: '#/body'
|
||||
self_ref: '#/groups/8'
|
||||
- children:
|
||||
- $ref: '#/texts/31'
|
||||
- $ref: '#/texts/32'
|
||||
content_layer: body
|
||||
label: inline
|
||||
name: group
|
||||
parent:
|
||||
$ref: '#/texts/30'
|
||||
self_ref: '#/groups/9'
|
||||
- children:
|
||||
- $ref: '#/texts/34'
|
||||
- $ref: '#/texts/35'
|
||||
- $ref: '#/texts/36'
|
||||
- $ref: '#/texts/37'
|
||||
content_layer: body
|
||||
label: inline
|
||||
name: group
|
||||
parent:
|
||||
$ref: '#/texts/33'
|
||||
$ref: '#/texts/34'
|
||||
self_ref: '#/groups/9'
|
||||
- children:
|
||||
- $ref: '#/texts/38'
|
||||
- $ref: '#/texts/39'
|
||||
- $ref: '#/texts/40'
|
||||
- $ref: '#/texts/41'
|
||||
content_layer: body
|
||||
label: inline
|
||||
name: group
|
||||
parent:
|
||||
$ref: '#/texts/37'
|
||||
self_ref: '#/groups/10'
|
||||
- children: []
|
||||
- children:
|
||||
- $ref: '#/texts/42'
|
||||
- $ref: '#/texts/43'
|
||||
content_layer: body
|
||||
label: inline
|
||||
name: group
|
||||
parent:
|
||||
$ref: '#/body'
|
||||
self_ref: '#/groups/11'
|
||||
- children:
|
||||
- $ref: '#/texts/45'
|
||||
- $ref: '#/texts/46'
|
||||
- $ref: '#/texts/47'
|
||||
content_layer: body
|
||||
label: inline
|
||||
name: group
|
||||
parent:
|
||||
$ref: '#/texts/44'
|
||||
self_ref: '#/groups/12'
|
||||
key_value_items: []
|
||||
name: inline_and_formatting
|
||||
origin:
|
||||
binary_hash: 15980020574215496313
|
||||
binary_hash: 1036526097556828366
|
||||
filename: inline_and_formatting.md
|
||||
mimetype: text/markdown
|
||||
pages: {}
|
||||
@ -613,18 +632,47 @@ texts:
|
||||
self_ref: '#/texts/26'
|
||||
text: Open a Pull Request
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
enumerated: true
|
||||
formatting:
|
||||
bold: true
|
||||
italic: false
|
||||
script: baseline
|
||||
strikethrough: false
|
||||
underline: false
|
||||
label: list_item
|
||||
marker: '-'
|
||||
orig: Whole list item has same formatting
|
||||
parent:
|
||||
$ref: '#/groups/2'
|
||||
prov: []
|
||||
self_ref: '#/texts/27'
|
||||
text: Whole list item has same formatting
|
||||
word_items_ids: []
|
||||
- children:
|
||||
- $ref: '#/groups/7'
|
||||
content_layer: body
|
||||
label: section_header
|
||||
level: 1
|
||||
enumerated: true
|
||||
label: list_item
|
||||
marker: '-'
|
||||
orig: ''
|
||||
parent:
|
||||
$ref: '#/body'
|
||||
$ref: '#/groups/2'
|
||||
prov: []
|
||||
self_ref: '#/texts/27'
|
||||
self_ref: '#/texts/28'
|
||||
text: ''
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
label: text
|
||||
orig: List item has
|
||||
parent:
|
||||
$ref: '#/groups/7'
|
||||
prov: []
|
||||
self_ref: '#/texts/29'
|
||||
text: List item has
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
formatting:
|
||||
@ -634,22 +682,48 @@ texts:
|
||||
strikethrough: false
|
||||
underline: false
|
||||
label: text
|
||||
orig: Second
|
||||
orig: mixed or partial
|
||||
parent:
|
||||
$ref: '#/groups/7'
|
||||
prov: []
|
||||
self_ref: '#/texts/28'
|
||||
text: Second
|
||||
self_ref: '#/texts/30'
|
||||
text: mixed or partial
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
label: text
|
||||
orig: section
|
||||
orig: formatting
|
||||
parent:
|
||||
$ref: '#/groups/7'
|
||||
prov: []
|
||||
self_ref: '#/texts/29'
|
||||
text: section
|
||||
self_ref: '#/texts/31'
|
||||
text: formatting
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
formatting:
|
||||
bold: false
|
||||
italic: true
|
||||
script: baseline
|
||||
strikethrough: false
|
||||
underline: false
|
||||
label: title
|
||||
orig: Whole heading is italic
|
||||
parent:
|
||||
$ref: '#/body'
|
||||
prov: []
|
||||
self_ref: '#/texts/32'
|
||||
text: Whole heading is italic
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
label: text
|
||||
orig: <<<<<<< HEAD
|
||||
parent:
|
||||
$ref: '#/body'
|
||||
prov: []
|
||||
self_ref: '#/texts/33'
|
||||
text: <<<<<<< HEAD
|
||||
word_items_ids: []
|
||||
- children:
|
||||
- $ref: '#/groups/9'
|
||||
@ -661,7 +735,7 @@ texts:
|
||||
parent:
|
||||
$ref: '#/groups/8'
|
||||
prov: []
|
||||
self_ref: '#/texts/30'
|
||||
self_ref: '#/texts/34'
|
||||
text: ''
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
@ -677,7 +751,7 @@ texts:
|
||||
parent:
|
||||
$ref: '#/groups/9'
|
||||
prov: []
|
||||
self_ref: '#/texts/31'
|
||||
self_ref: '#/texts/35'
|
||||
text: First
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
@ -687,7 +761,7 @@ texts:
|
||||
parent:
|
||||
$ref: '#/groups/9'
|
||||
prov: []
|
||||
self_ref: '#/texts/32'
|
||||
self_ref: '#/texts/36'
|
||||
text: ': Lorem ipsum.'
|
||||
word_items_ids: []
|
||||
- children:
|
||||
@ -700,7 +774,7 @@ texts:
|
||||
parent:
|
||||
$ref: '#/groups/8'
|
||||
prov: []
|
||||
self_ref: '#/texts/33'
|
||||
self_ref: '#/texts/37'
|
||||
text: ''
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
@ -716,7 +790,7 @@ texts:
|
||||
parent:
|
||||
$ref: '#/groups/10'
|
||||
prov: []
|
||||
self_ref: '#/texts/34'
|
||||
self_ref: '#/texts/38'
|
||||
text: Second
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
@ -726,7 +800,7 @@ texts:
|
||||
parent:
|
||||
$ref: '#/groups/10'
|
||||
prov: []
|
||||
self_ref: '#/texts/35'
|
||||
self_ref: '#/texts/39'
|
||||
text: ': Dolor'
|
||||
word_items_ids: []
|
||||
- captions: []
|
||||
@ -740,7 +814,7 @@ texts:
|
||||
$ref: '#/groups/10'
|
||||
prov: []
|
||||
references: []
|
||||
self_ref: '#/texts/36'
|
||||
self_ref: '#/texts/40'
|
||||
text: sit
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
@ -750,7 +824,110 @@ texts:
|
||||
parent:
|
||||
$ref: '#/groups/10'
|
||||
prov: []
|
||||
self_ref: '#/texts/37'
|
||||
self_ref: '#/texts/41'
|
||||
text: amet.
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
label: text
|
||||
orig: Some
|
||||
parent:
|
||||
$ref: '#/groups/11'
|
||||
prov: []
|
||||
self_ref: '#/texts/42'
|
||||
text: Some
|
||||
word_items_ids: []
|
||||
- captions: []
|
||||
children: []
|
||||
code_language: unknown
|
||||
content_layer: body
|
||||
footnotes: []
|
||||
formatting:
|
||||
bold: false
|
||||
italic: true
|
||||
script: baseline
|
||||
strikethrough: false
|
||||
underline: false
|
||||
label: code
|
||||
orig: formatted_code
|
||||
parent:
|
||||
$ref: '#/groups/11'
|
||||
prov: []
|
||||
references: []
|
||||
self_ref: '#/texts/43'
|
||||
text: formatted_code
|
||||
word_items_ids: []
|
||||
- children:
|
||||
- $ref: '#/groups/12'
|
||||
content_layer: body
|
||||
label: section_header
|
||||
level: 1
|
||||
orig: ''
|
||||
parent:
|
||||
$ref: '#/body'
|
||||
prov: []
|
||||
self_ref: '#/texts/44'
|
||||
text: ''
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
formatting:
|
||||
bold: false
|
||||
italic: true
|
||||
script: baseline
|
||||
strikethrough: false
|
||||
underline: false
|
||||
label: text
|
||||
orig: Partially formatted
|
||||
parent:
|
||||
$ref: '#/groups/12'
|
||||
prov: []
|
||||
self_ref: '#/texts/45'
|
||||
text: Partially formatted
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
label: text
|
||||
orig: heading to_escape
|
||||
parent:
|
||||
$ref: '#/groups/12'
|
||||
prov: []
|
||||
self_ref: '#/texts/46'
|
||||
text: heading to_escape
|
||||
word_items_ids: []
|
||||
- captions: []
|
||||
children: []
|
||||
code_language: unknown
|
||||
content_layer: body
|
||||
footnotes: []
|
||||
label: code
|
||||
orig: not_to_escape
|
||||
parent:
|
||||
$ref: '#/groups/12'
|
||||
prov: []
|
||||
references: []
|
||||
self_ref: '#/texts/47'
|
||||
text: not_to_escape
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
hyperlink: https://en.wikipedia.org/wiki/Albert_Einstein
|
||||
label: text
|
||||
orig: $$E=mc^2$$
|
||||
parent:
|
||||
$ref: '#/body'
|
||||
prov: []
|
||||
self_ref: '#/texts/48'
|
||||
text: $$E=mc^2$$
|
||||
word_items_ids: []
|
||||
- children: []
|
||||
content_layer: body
|
||||
label: text
|
||||
orig: origin/main
|
||||
parent:
|
||||
$ref: '#/body'
|
||||
prov: []
|
||||
self_ref: '#/texts/49'
|
||||
text: origin/main
|
||||
word_items_ids: []
|
||||
version: 1.4.0
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "ipa20180000016.xml",
|
||||
"origin": {
|
||||
"mimetype": "application/xml",
|
||||
@ -6005,7 +6005,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "ipa20200022300.xml",
|
||||
"origin": {
|
||||
"mimetype": "application/xml",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "lorem_ipsum",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -66,7 +66,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -96,7 +97,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -126,7 +128,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -156,7 +159,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -186,7 +190,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
}
|
||||
],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "multi_page",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "pa20010031492.xml",
|
||||
"origin": {
|
||||
"mimetype": "application/xml",
|
||||
@ -2127,7 +2127,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "pftaps057006474.txt",
|
||||
"origin": {
|
||||
"mimetype": "text/plain",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "pg06442728.xml",
|
||||
"origin": {
|
||||
"mimetype": "application/xml",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "picture_classification",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "powerpoint_bad_text",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.ms-powerpoint",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "powerpoint_sample",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.ms-powerpoint",
|
||||
@ -2199,7 +2199,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "powerpoint_with_image",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.ms-powerpoint",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "redp5110_sampled",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
@ -12471,7 +12471,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/1",
|
||||
@ -13096,7 +13097,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/2",
|
||||
@ -15356,7 +15358,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/3",
|
||||
@ -15713,7 +15716,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/4",
|
||||
@ -16918,7 +16922,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "right_to_left_01",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "right_to_left_02",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "right_to_left_03",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "sample_sales_data",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
@ -2136,7 +2136,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "tablecell",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -78,7 +78,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -98,7 +99,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -130,7 +132,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -172,7 +175,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
}
|
||||
],
|
||||
@ -419,7 +423,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "test-01",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
@ -681,7 +681,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/1",
|
||||
@ -1599,7 +1600,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/2",
|
||||
@ -2005,7 +2007,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/3",
|
||||
@ -2411,7 +2414,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/4",
|
||||
@ -2893,7 +2897,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/5",
|
||||
@ -3375,7 +3380,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "test_emf_docx",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -60,7 +60,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -78,7 +79,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -96,7 +98,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -114,7 +117,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
}
|
||||
],
|
||||
|
115
tests/data/groundtruth/docling_v2/textbox.docx.itxt
vendored
115
tests/data/groundtruth/docling_v2/textbox.docx.itxt
vendored
@ -11,83 +11,82 @@ item-0 at level 0: unspecified: group _root_
|
||||
* Blisters
|
||||
* Headache
|
||||
* Sore throat
|
||||
item-9 at level 1: list_item:
|
||||
item-9 at level 1: paragraph:
|
||||
item-10 at level 1: paragraph:
|
||||
item-11 at level 1: paragraph:
|
||||
item-12 at level 1: section: group textbox
|
||||
item-13 at level 2: paragraph: If a caregiver suspects that wit ... the same suggested reportable symptoms
|
||||
item-11 at level 1: section: group textbox
|
||||
item-12 at level 2: paragraph: If a caregiver suspects that wit ... the same suggested reportable symptoms
|
||||
item-13 at level 1: paragraph:
|
||||
item-14 at level 1: paragraph:
|
||||
item-15 at level 1: paragraph:
|
||||
item-16 at level 1: paragraph:
|
||||
item-17 at level 1: paragraph:
|
||||
item-18 at level 1: section: group textbox
|
||||
item-19 at level 2: paragraph: Yes
|
||||
item-17 at level 1: section: group textbox
|
||||
item-18 at level 2: paragraph: Yes
|
||||
item-19 at level 1: paragraph:
|
||||
item-20 at level 1: paragraph:
|
||||
item-21 at level 1: paragraph:
|
||||
item-22 at level 1: section: group textbox
|
||||
item-23 at level 2: list: group list
|
||||
item-24 at level 3: list_item: A report must be submitted withi ... saster Prevention Information Network.
|
||||
item-25 at level 3: list_item: A report must also be submitted ... d Infectious Disease Reporting System.
|
||||
item-26 at level 2: paragraph:
|
||||
item-27 at level 1: list: group list
|
||||
item-28 at level 2: list_item:
|
||||
item-21 at level 1: section: group textbox
|
||||
item-22 at level 2: list: group list
|
||||
item-23 at level 3: list_item: A report must be submitted withi ... saster Prevention Information Network.
|
||||
item-24 at level 3: list_item: A report must also be submitted ... d Infectious Disease Reporting System.
|
||||
item-25 at level 2: paragraph:
|
||||
item-26 at level 1: list: group list
|
||||
item-27 at level 2: list_item:
|
||||
item-28 at level 1: paragraph:
|
||||
item-29 at level 1: paragraph:
|
||||
item-30 at level 1: paragraph:
|
||||
item-31 at level 1: paragraph:
|
||||
item-32 at level 1: paragraph:
|
||||
item-33 at level 1: paragraph:
|
||||
item-34 at level 1: section: group textbox
|
||||
item-35 at level 2: paragraph: Health Bureau:
|
||||
item-36 at level 2: paragraph: Upon receiving a report from the ... rt to the Centers for Disease Control.
|
||||
item-37 at level 2: list: group list
|
||||
item-38 at level 3: list_item: If necessary, provide health edu ... vidual to undergo specimen collection.
|
||||
item-39 at level 3: list_item: Implement appropriate epidemic p ... the Communicable Disease Control Act.
|
||||
item-40 at level 2: paragraph:
|
||||
item-41 at level 1: list: group list
|
||||
item-42 at level 2: list_item:
|
||||
item-43 at level 1: paragraph:
|
||||
item-44 at level 1: section: group textbox
|
||||
item-45 at level 2: paragraph: Department of Education:
|
||||
item-33 at level 1: section: group textbox
|
||||
item-34 at level 2: paragraph: Health Bureau:
|
||||
item-35 at level 2: paragraph: Upon receiving a report from the ... rt to the Centers for Disease Control.
|
||||
item-36 at level 2: list: group list
|
||||
item-37 at level 3: list_item: If necessary, provide health edu ... vidual to undergo specimen collection.
|
||||
item-38 at level 3: list_item: Implement appropriate epidemic p ... the Communicable Disease Control Act.
|
||||
item-39 at level 2: paragraph:
|
||||
item-40 at level 1: list: group list
|
||||
item-41 at level 2: list_item:
|
||||
item-42 at level 1: paragraph:
|
||||
item-43 at level 1: section: group textbox
|
||||
item-44 at level 2: paragraph: Department of Education:
|
||||
Collabo ... vention measures at all school levels.
|
||||
item-45 at level 1: paragraph:
|
||||
item-46 at level 1: paragraph:
|
||||
item-47 at level 1: paragraph:
|
||||
item-48 at level 1: paragraph:
|
||||
item-49 at level 1: paragraph:
|
||||
item-50 at level 1: paragraph:
|
||||
item-51 at level 1: paragraph:
|
||||
item-52 at level 1: paragraph:
|
||||
item-53 at level 1: section: group textbox
|
||||
item-54 at level 2: inline: group group
|
||||
item-55 at level 3: paragraph: The Health Bureau will handle
|
||||
item-56 at level 3: paragraph: reporting and specimen collection
|
||||
item-57 at level 3: paragraph: .
|
||||
item-58 at level 2: paragraph:
|
||||
item-52 at level 1: section: group textbox
|
||||
item-53 at level 2: inline: group group
|
||||
item-54 at level 3: paragraph: The Health Bureau will handle
|
||||
item-55 at level 3: paragraph: reporting and specimen collection
|
||||
item-56 at level 3: paragraph: .
|
||||
item-57 at level 2: paragraph:
|
||||
item-58 at level 1: paragraph:
|
||||
item-59 at level 1: paragraph:
|
||||
item-60 at level 1: paragraph:
|
||||
item-61 at level 1: paragraph:
|
||||
item-62 at level 1: section: group textbox
|
||||
item-63 at level 2: paragraph: Whether the epidemic has eased.
|
||||
item-64 at level 2: paragraph:
|
||||
item-65 at level 1: paragraph:
|
||||
item-66 at level 1: section: group textbox
|
||||
item-67 at level 2: paragraph: Whether the test results are pos ... legally designated infectious disease.
|
||||
item-68 at level 2: paragraph: No
|
||||
item-61 at level 1: section: group textbox
|
||||
item-62 at level 2: paragraph: Whether the epidemic has eased.
|
||||
item-63 at level 2: paragraph:
|
||||
item-64 at level 1: paragraph:
|
||||
item-65 at level 1: section: group textbox
|
||||
item-66 at level 2: paragraph: Whether the test results are pos ... legally designated infectious disease.
|
||||
item-67 at level 2: paragraph: No
|
||||
item-68 at level 1: paragraph:
|
||||
item-69 at level 1: paragraph:
|
||||
item-70 at level 1: paragraph:
|
||||
item-71 at level 1: section: group textbox
|
||||
item-72 at level 2: paragraph: Yes
|
||||
item-73 at level 1: paragraph:
|
||||
item-74 at level 1: section: group textbox
|
||||
item-75 at level 2: paragraph: Yes
|
||||
item-70 at level 1: section: group textbox
|
||||
item-71 at level 2: paragraph: Yes
|
||||
item-72 at level 1: paragraph:
|
||||
item-73 at level 1: section: group textbox
|
||||
item-74 at level 2: paragraph: Yes
|
||||
item-75 at level 1: paragraph:
|
||||
item-76 at level 1: paragraph:
|
||||
item-77 at level 1: paragraph:
|
||||
item-78 at level 1: section: group textbox
|
||||
item-79 at level 2: paragraph: Case closed.
|
||||
item-80 at level 2: paragraph:
|
||||
item-81 at level 2: paragraph: The Health Bureau will carry out ... ters for Disease Control if necessary.
|
||||
item-82 at level 1: paragraph:
|
||||
item-83 at level 1: section: group textbox
|
||||
item-84 at level 2: paragraph: No
|
||||
item-77 at level 1: section: group textbox
|
||||
item-78 at level 2: paragraph: Case closed.
|
||||
item-79 at level 2: paragraph:
|
||||
item-80 at level 2: paragraph: The Health Bureau will carry out ... ters for Disease Control if necessary.
|
||||
item-81 at level 1: paragraph:
|
||||
item-82 at level 1: section: group textbox
|
||||
item-83 at level 2: paragraph: No
|
||||
item-84 at level 1: paragraph:
|
||||
item-85 at level 1: paragraph:
|
||||
item-86 at level 1: paragraph:
|
||||
item-87 at level 1: paragraph:
|
433
tests/data/groundtruth/docling_v2/textbox.docx.json
vendored
433
tests/data/groundtruth/docling_v2/textbox.docx.json
vendored
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "textbox",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -36,10 +36,10 @@
|
||||
"$ref": "#/texts/7"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/8"
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/2"
|
||||
"$ref": "#/texts/9"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/10"
|
||||
@ -50,17 +50,14 @@
|
||||
{
|
||||
"$ref": "#/texts/12"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/13"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/3"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/15"
|
||||
"$ref": "#/texts/14"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/16"
|
||||
"$ref": "#/texts/15"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/4"
|
||||
@ -68,6 +65,9 @@
|
||||
{
|
||||
"$ref": "#/groups/6"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/20"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/21"
|
||||
},
|
||||
@ -80,9 +80,6 @@
|
||||
{
|
||||
"$ref": "#/texts/24"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/25"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/7"
|
||||
},
|
||||
@ -90,11 +87,14 @@
|
||||
"$ref": "#/groups/9"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/32"
|
||||
"$ref": "#/texts/31"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/10"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/33"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/34"
|
||||
},
|
||||
@ -114,10 +114,10 @@
|
||||
"$ref": "#/texts/39"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/40"
|
||||
"$ref": "#/groups/11"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/11"
|
||||
"$ref": "#/texts/44"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/45"
|
||||
@ -125,56 +125,53 @@
|
||||
{
|
||||
"$ref": "#/texts/46"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/47"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/13"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/50"
|
||||
"$ref": "#/texts/49"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/14"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/53"
|
||||
"$ref": "#/texts/52"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/54"
|
||||
"$ref": "#/texts/53"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/15"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/56"
|
||||
"$ref": "#/texts/55"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/16"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/58"
|
||||
"$ref": "#/texts/57"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/59"
|
||||
"$ref": "#/texts/58"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/17"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/63"
|
||||
"$ref": "#/texts/62"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/18"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/64"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/65"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/66"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/67"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -223,7 +220,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/9"
|
||||
"$ref": "#/texts/8"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -237,7 +234,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/14"
|
||||
"$ref": "#/texts/13"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -254,7 +251,7 @@
|
||||
"$ref": "#/groups/5"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/19"
|
||||
"$ref": "#/texts/18"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -268,10 +265,10 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/17"
|
||||
"$ref": "#/texts/16"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/18"
|
||||
"$ref": "#/texts/17"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -285,7 +282,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/20"
|
||||
"$ref": "#/texts/19"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -299,16 +296,16 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/26"
|
||||
"$ref": "#/texts/25"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/27"
|
||||
"$ref": "#/texts/26"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/8"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/30"
|
||||
"$ref": "#/texts/29"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -322,10 +319,10 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/28"
|
||||
"$ref": "#/texts/27"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/29"
|
||||
"$ref": "#/texts/28"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -339,7 +336,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/31"
|
||||
"$ref": "#/texts/30"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -353,7 +350,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/33"
|
||||
"$ref": "#/texts/32"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -370,7 +367,7 @@
|
||||
"$ref": "#/groups/12"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/44"
|
||||
"$ref": "#/texts/43"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -383,14 +380,14 @@
|
||||
"$ref": "#/groups/11"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/40"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/41"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/42"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/43"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -404,10 +401,10 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/48"
|
||||
"$ref": "#/texts/47"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/49"
|
||||
"$ref": "#/texts/48"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -421,10 +418,10 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/51"
|
||||
"$ref": "#/texts/50"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/52"
|
||||
"$ref": "#/texts/51"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -438,7 +435,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/55"
|
||||
"$ref": "#/texts/54"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -452,7 +449,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/57"
|
||||
"$ref": "#/texts/56"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -465,14 +462,14 @@
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/59"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/60"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/61"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/62"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -486,7 +483,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/64"
|
||||
"$ref": "#/texts/63"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -510,7 +507,8 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -528,7 +526,8 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -558,7 +557,8 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -588,7 +588,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -600,12 +601,10 @@
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": "",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/7",
|
||||
@ -621,18 +620,6 @@
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/8",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/9",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
@ -646,9 +633,22 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/9",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/10",
|
||||
"parent": {
|
||||
@ -687,18 +687,6 @@
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/13",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/14",
|
||||
"parent": {
|
||||
"$ref": "#/groups/3"
|
||||
},
|
||||
@ -712,9 +700,22 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/14",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/15",
|
||||
"parent": {
|
||||
@ -729,18 +730,6 @@
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/16",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/17",
|
||||
"parent": {
|
||||
"$ref": "#/groups/5"
|
||||
},
|
||||
@ -754,13 +743,14 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/18",
|
||||
"self_ref": "#/texts/17",
|
||||
"parent": {
|
||||
"$ref": "#/groups/5"
|
||||
},
|
||||
@ -774,13 +764,14 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/19",
|
||||
"self_ref": "#/texts/18",
|
||||
"parent": {
|
||||
"$ref": "#/groups/4"
|
||||
},
|
||||
@ -792,7 +783,7 @@
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/20",
|
||||
"self_ref": "#/texts/19",
|
||||
"parent": {
|
||||
"$ref": "#/groups/6"
|
||||
},
|
||||
@ -805,6 +796,18 @@
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/20",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/21",
|
||||
"parent": {
|
||||
@ -855,18 +858,6 @@
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/25",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/26",
|
||||
"parent": {
|
||||
"$ref": "#/groups/7"
|
||||
},
|
||||
@ -880,11 +871,12 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/27",
|
||||
"self_ref": "#/texts/26",
|
||||
"parent": {
|
||||
"$ref": "#/groups/7"
|
||||
},
|
||||
@ -898,11 +890,12 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/28",
|
||||
"self_ref": "#/texts/27",
|
||||
"parent": {
|
||||
"$ref": "#/groups/8"
|
||||
},
|
||||
@ -916,13 +909,14 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/29",
|
||||
"self_ref": "#/texts/28",
|
||||
"parent": {
|
||||
"$ref": "#/groups/8"
|
||||
},
|
||||
@ -936,13 +930,14 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/30",
|
||||
"self_ref": "#/texts/29",
|
||||
"parent": {
|
||||
"$ref": "#/groups/7"
|
||||
},
|
||||
@ -954,7 +949,7 @@
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/31",
|
||||
"self_ref": "#/texts/30",
|
||||
"parent": {
|
||||
"$ref": "#/groups/9"
|
||||
},
|
||||
@ -968,7 +963,7 @@
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/32",
|
||||
"self_ref": "#/texts/31",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
@ -980,7 +975,7 @@
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/33",
|
||||
"self_ref": "#/texts/32",
|
||||
"parent": {
|
||||
"$ref": "#/groups/10"
|
||||
},
|
||||
@ -994,9 +989,22 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/33",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/34",
|
||||
"parent": {
|
||||
@ -1071,18 +1079,6 @@
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/40",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/41",
|
||||
"parent": {
|
||||
"$ref": "#/groups/12"
|
||||
},
|
||||
@ -1096,11 +1092,12 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/42",
|
||||
"self_ref": "#/texts/41",
|
||||
"parent": {
|
||||
"$ref": "#/groups/12"
|
||||
},
|
||||
@ -1114,11 +1111,12 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/43",
|
||||
"self_ref": "#/texts/42",
|
||||
"parent": {
|
||||
"$ref": "#/groups/12"
|
||||
},
|
||||
@ -1132,13 +1130,26 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/43",
|
||||
"parent": {
|
||||
"$ref": "#/groups/11"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/44",
|
||||
"parent": {
|
||||
"$ref": "#/groups/11"
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
@ -1173,18 +1184,6 @@
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/47",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/48",
|
||||
"parent": {
|
||||
"$ref": "#/groups/13"
|
||||
},
|
||||
@ -1198,11 +1197,12 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/49",
|
||||
"self_ref": "#/texts/48",
|
||||
"parent": {
|
||||
"$ref": "#/groups/13"
|
||||
},
|
||||
@ -1214,7 +1214,7 @@
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/50",
|
||||
"self_ref": "#/texts/49",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
@ -1226,7 +1226,7 @@
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/51",
|
||||
"self_ref": "#/texts/50",
|
||||
"parent": {
|
||||
"$ref": "#/groups/14"
|
||||
},
|
||||
@ -1240,11 +1240,12 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/52",
|
||||
"self_ref": "#/texts/51",
|
||||
"parent": {
|
||||
"$ref": "#/groups/14"
|
||||
},
|
||||
@ -1258,9 +1259,22 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/52",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/53",
|
||||
"parent": {
|
||||
@ -1275,18 +1289,6 @@
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/54",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/55",
|
||||
"parent": {
|
||||
"$ref": "#/groups/15"
|
||||
},
|
||||
@ -1300,11 +1302,12 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/56",
|
||||
"self_ref": "#/texts/55",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
@ -1316,7 +1319,7 @@
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/57",
|
||||
"self_ref": "#/texts/56",
|
||||
"parent": {
|
||||
"$ref": "#/groups/16"
|
||||
},
|
||||
@ -1330,9 +1333,22 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/57",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/58",
|
||||
"parent": {
|
||||
@ -1347,18 +1363,6 @@
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/59",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/60",
|
||||
"parent": {
|
||||
"$ref": "#/groups/17"
|
||||
},
|
||||
@ -1372,11 +1376,12 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/61",
|
||||
"self_ref": "#/texts/60",
|
||||
"parent": {
|
||||
"$ref": "#/groups/17"
|
||||
},
|
||||
@ -1388,7 +1393,7 @@
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/62",
|
||||
"self_ref": "#/texts/61",
|
||||
"parent": {
|
||||
"$ref": "#/groups/17"
|
||||
},
|
||||
@ -1402,11 +1407,12 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/63",
|
||||
"self_ref": "#/texts/62",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
@ -1418,7 +1424,7 @@
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/64",
|
||||
"self_ref": "#/texts/63",
|
||||
"parent": {
|
||||
"$ref": "#/groups/18"
|
||||
},
|
||||
@ -1432,9 +1438,22 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/64",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/65",
|
||||
"parent": {
|
||||
@ -1458,18 +1477,6 @@
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/67",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
}
|
||||
],
|
||||
"pictures": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "unit_test_01",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
|
@ -17,14 +17,16 @@ item-0 at level 0: unspecified: group _root_
|
||||
item-16 at level 2: list_item: Italic bullet 1
|
||||
item-17 at level 2: list_item: Bold bullet 2
|
||||
item-18 at level 2: list_item: Underline bullet 3
|
||||
item-19 at level 2: inline: group group
|
||||
item-20 at level 3: list_item: Some
|
||||
item-21 at level 3: list_item: italic
|
||||
item-22 at level 3: list_item: bold
|
||||
item-23 at level 3: list_item: underline
|
||||
item-24 at level 2: list: group list
|
||||
item-25 at level 3: inline: group group
|
||||
item-26 at level 4: list_item: Nested
|
||||
item-27 at level 4: list_item: italic
|
||||
item-28 at level 4: list_item: bold
|
||||
item-29 at level 1: paragraph:
|
||||
item-19 at level 2: list_item:
|
||||
item-20 at level 3: inline: group group
|
||||
item-21 at level 4: text: Some
|
||||
item-22 at level 4: text: italic
|
||||
item-23 at level 4: text: bold
|
||||
item-24 at level 4: text: underline
|
||||
item-25 at level 2: list: group list
|
||||
item-26 at level 3: list_item:
|
||||
item-27 at level 4: inline: group group
|
||||
item-28 at level 5: text: Nested
|
||||
item-29 at level 5: text: italic
|
||||
item-30 at level 5: text: bold
|
||||
item-31 at level 1: paragraph:
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "unit_test_formatting",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -42,7 +42,7 @@
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/23"
|
||||
"$ref": "#/texts/25"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -98,7 +98,7 @@
|
||||
"$ref": "#/texts/15"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/2"
|
||||
"$ref": "#/texts/16"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/3"
|
||||
@ -111,12 +111,9 @@
|
||||
{
|
||||
"self_ref": "#/groups/2",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/16"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/17"
|
||||
},
|
||||
@ -125,6 +122,9 @@
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/19"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/20"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -138,7 +138,7 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/4"
|
||||
"$ref": "#/texts/21"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -148,17 +148,17 @@
|
||||
{
|
||||
"self_ref": "#/groups/4",
|
||||
"parent": {
|
||||
"$ref": "#/groups/3"
|
||||
"$ref": "#/texts/21"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/20"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/21"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/22"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/23"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/24"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@ -182,7 +182,8 @@
|
||||
"bold": false,
|
||||
"italic": true,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -200,7 +201,8 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -218,7 +220,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": true,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -236,7 +239,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"hyperlink": "https:/github.com/DS4SD/docling"
|
||||
},
|
||||
@ -255,7 +259,8 @@
|
||||
"bold": true,
|
||||
"italic": true,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"hyperlink": "https:/github.com/DS4SD/docling"
|
||||
},
|
||||
@ -274,7 +279,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -292,7 +298,8 @@
|
||||
"bold": false,
|
||||
"italic": true,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -310,7 +317,8 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -328,7 +336,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": true,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -346,7 +355,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -364,7 +374,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"hyperlink": "https:/github.com/DS4SD/docling"
|
||||
},
|
||||
@ -383,7 +394,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -413,7 +425,8 @@
|
||||
"bold": false,
|
||||
"italic": true,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -433,7 +446,8 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -453,7 +467,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": true,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -461,20 +476,18 @@
|
||||
{
|
||||
"self_ref": "#/texts/16",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/2"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Some",
|
||||
"text": "Some",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
},
|
||||
"orig": "",
|
||||
"text": "",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
@ -485,18 +498,17 @@
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "italic",
|
||||
"text": "italic",
|
||||
"orig": "Some",
|
||||
"text": "Some",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/18",
|
||||
@ -505,18 +517,17 @@
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "bold",
|
||||
"text": "bold",
|
||||
"orig": "italic",
|
||||
"text": "italic",
|
||||
"formatting": {
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"bold": false,
|
||||
"italic": true,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/19",
|
||||
@ -525,7 +536,26 @@
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "bold",
|
||||
"text": "bold",
|
||||
"formatting": {
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/20",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "underline",
|
||||
"text": "underline",
|
||||
@ -533,48 +563,25 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": true,
|
||||
"strikethrough": false
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/20",
|
||||
"parent": {
|
||||
"$ref": "#/groups/4"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Nested",
|
||||
"text": "Nested",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/21",
|
||||
"parent": {
|
||||
"$ref": "#/groups/4"
|
||||
"$ref": "#/groups/3"
|
||||
},
|
||||
"children": [],
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/4"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "italic",
|
||||
"text": "italic",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": true,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
},
|
||||
"orig": "",
|
||||
"text": "",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
@ -585,7 +592,45 @@
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "Nested",
|
||||
"text": "Nested",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/23",
|
||||
"parent": {
|
||||
"$ref": "#/groups/4"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "italic",
|
||||
"text": "italic",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": true,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/24",
|
||||
"parent": {
|
||||
"$ref": "#/groups/4"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "bold",
|
||||
"text": "bold",
|
||||
@ -593,13 +638,12 @@
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/23",
|
||||
"self_ref": "#/texts/25",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "unit_test_headers",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -138,7 +138,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -168,7 +169,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -239,7 +241,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -269,7 +272,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -343,7 +347,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -373,7 +378,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -447,7 +453,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -477,7 +484,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -566,7 +574,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -596,7 +605,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -667,7 +677,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -697,7 +708,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -771,7 +783,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -801,7 +814,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "unit_test_headers_numbered",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -214,7 +214,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -244,7 +245,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -315,7 +317,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -345,7 +348,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -419,7 +423,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -449,7 +454,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -523,7 +529,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -553,7 +560,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -620,7 +628,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -650,7 +659,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -721,7 +731,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -751,7 +762,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -825,7 +837,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -855,7 +868,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "unit_test_lists",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -370,7 +370,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -400,7 +401,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -450,7 +452,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -470,7 +473,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -490,7 +494,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -542,7 +547,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -562,7 +568,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -582,7 +589,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -634,7 +642,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -654,7 +663,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -674,7 +684,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -694,7 +705,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -714,7 +726,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -734,7 +747,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -786,7 +800,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -806,7 +821,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -826,7 +842,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -878,7 +895,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -898,7 +916,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -918,7 +937,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -938,7 +958,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -996,7 +1017,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -1016,7 +1038,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -1036,7 +1059,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -1056,7 +1080,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -1076,7 +1101,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -1096,7 +1122,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "wiki_duck",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
@ -8489,7 +8489,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/1",
|
||||
@ -8648,7 +8649,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
16
tests/data/groundtruth/docling_v2/word_image_anchors.docx.itxt
vendored
Normal file
16
tests/data/groundtruth/docling_v2/word_image_anchors.docx.itxt
vendored
Normal file
@ -0,0 +1,16 @@
|
||||
item-0 at level 0: unspecified: group _root_
|
||||
item-1 at level 1: paragraph: Transcript
|
||||
item-2 at level 1: paragraph: February 20, 2025, 8:32PM
|
||||
item-3 at level 1: picture
|
||||
item-4 at level 1: inline: group group
|
||||
item-5 at level 2: paragraph: This is test 1
|
||||
item-6 at level 2: paragraph: 0:08
|
||||
Correct, he is not.
|
||||
item-7 at level 1: paragraph:
|
||||
item-8 at level 1: picture
|
||||
item-9 at level 1: inline: group group
|
||||
item-10 at level 2: paragraph: This is test 2
|
||||
item-11 at level 2: paragraph: 0:16
|
||||
Yeah, exactly.
|
||||
item-12 at level 1: paragraph:
|
||||
item-13 at level 1: paragraph:
|
292
tests/data/groundtruth/docling_v2/word_image_anchors.docx.json
vendored
Normal file
292
tests/data/groundtruth/docling_v2/word_image_anchors.docx.json
vendored
Normal file
@ -0,0 +1,292 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.4.0",
|
||||
"name": "word_image_anchors",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
"binary_hash": 2428692234257307633,
|
||||
"filename": "word_image_anchors.docx"
|
||||
},
|
||||
"furniture": {
|
||||
"self_ref": "#/furniture",
|
||||
"children": [],
|
||||
"content_layer": "furniture",
|
||||
"name": "_root_",
|
||||
"label": "unspecified"
|
||||
},
|
||||
"body": {
|
||||
"self_ref": "#/body",
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/0"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/1"
|
||||
},
|
||||
{
|
||||
"$ref": "#/pictures/0"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/4"
|
||||
},
|
||||
{
|
||||
"$ref": "#/pictures/1"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/7"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/8"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "_root_",
|
||||
"label": "unspecified"
|
||||
},
|
||||
"groups": [
|
||||
{
|
||||
"self_ref": "#/groups/0",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/2"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/3"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "group",
|
||||
"label": "inline"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/1",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/5"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/6"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "group",
|
||||
"label": "inline"
|
||||
}
|
||||
],
|
||||
"texts": [
|
||||
{
|
||||
"self_ref": "#/texts/0",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Transcript",
|
||||
"text": "Transcript",
|
||||
"formatting": {
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/1",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "February 20, 2025, 8:32PM",
|
||||
"text": "February 20, 2025, 8:32PM",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/2",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is test 1",
|
||||
"text": "This is test 1",
|
||||
"formatting": {
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/3",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "0:08\nCorrect, he is not.",
|
||||
"text": "0:08\nCorrect, he is not.",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/4",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/5",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is test 2",
|
||||
"text": "This is test 2",
|
||||
"formatting": {
|
||||
"bold": true,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/6",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "0:16\nYeah, exactly.",
|
||||
"text": "0:16\nYeah, exactly.",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/7",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/8",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
}
|
||||
],
|
||||
"pictures": [
|
||||
{
|
||||
"self_ref": "#/pictures/0",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "picture",
|
||||
"prov": [],
|
||||
"captions": [],
|
||||
"references": [],
|
||||
"footnotes": [],
|
||||
"image": {
|
||||
"mimetype": "image/png",
|
||||
"dpi": 72,
|
||||
"size": {
|
||||
"width": 100.0,
|
||||
"height": 100.0
|
||||
},
|
||||
"uri": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGQAAABkCAYAAABw4pVUAAAAz0lEQVR4nO3bUW0CURRF0TukQvDSauBr0mACE1VBAzYQg5Lpdw0wO2EtA+cl+/6+GQAAAAAAAAAAAADe1DIR53X9mcNcdhnf5nm93Y8T8DElyzyuv/evlx/CMqeJOOz9AP4TJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiWp8+t/k8f6/bDrvPl28CAAAAAAAAAAAAAAAAzLv5A5bTEG2TIIlOAAAAAElFTkSuQmCC"
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/pictures/1",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "picture",
|
||||
"prov": [],
|
||||
"captions": [],
|
||||
"references": [],
|
||||
"footnotes": [],
|
||||
"image": {
|
||||
"mimetype": "image/png",
|
||||
"dpi": 72,
|
||||
"size": {
|
||||
"width": 100.0,
|
||||
"height": 100.0
|
||||
},
|
||||
"uri": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGQAAABkCAYAAABw4pVUAAAJIElEQVR4nO2dbWxb1RnH/8+1c5O4bITEwJrRF1ZAI6gtL9oK29oxihAdTQOVoGhbKyS0MDWZJk1CQ+q0aR/4xLYvJNGabdK07MukrSUNaxEvg7aUlteuLUoHrUTbseylSRSgpLGd3Ac9596kSWzHvva1fXzv/UmW4jaxj5+/z73nPOec50/QnM3t5xdbUWOlZeBGgK8jNpYC3AxQHOAGEMXAXKN+mSgF5nGAxgAeBmiIyToH0GnDwklj0jqxq/fK/0BjCJrR2jn8ZcPCXSBaC9DtAC/39h3oDMBHwHzQMvD3ga74P6ERWgjS1jG8BjAeALgVQEuZ334QoAHA2t3fHX8dQRWktX0obpi1jzDjewSshgYwcIwIf7KSiT8M9DYPB0KQts7RlWDuANCuSw/NAAPoBVF3f1fjCZQRKq8QeBzgragqqA+Ep8olDJXj0kSm+XNi6kQVw8RdnEz+otSXspIK0rZ9eDuIngTQAH8wBuYd/T3xnqoSRIauERi/ZuYN8CFEtG8K1o9LMWT2XJBN20e+TwZ1gdmEnyFKssWde3qafuvpy3r5Ym0dI78B8BiCxc7+7qYfaCXIxvbRpZEa7gOwDsHkwFSKtj7b23iu4oLYs2z6M4BlCDZnAd5S7Gy/KEHu3z5yDxN2AVhUzOv4iE+JsfmZnqbnyy7Iph+O3kcWD2g8264UzAa17nm68W+F/DEV0TOeC8XIChPj3kJ6ChV4z3gpvEzl5FOA17u9pxhuR1PODTwUIzeLJFZ2zEokiDO0Dfpoyg3LnJh5L4gz6QvqPKMY1jmx804QSYcEcAbuJY85MSz+pq7WuGEc831uqtQQJS1Yq3MlJHP2EMnahmJ4ALOpYpkDI9d6hl9T6JVAYqnWiAq5ZKlNCDW1p3y0uKQLY1YqcX22lcesPUSWXUMxSkKDE9v8e4izM+R4adoToiBalWnjROYeonaHhJSULDFO6yFh76hsL0nvIfYmtpBykCHWlGFk9X8d0uqrbqjBj7YtQlODq3QbLAtIphgffcL44N+TeO1oEgfeSkJT2Eolrpo94orO/l/ZawuuvBjFYBhAXS2px9VxE2tWmdjWZmHvgQnsemECmkEq5sAvp/9hztdPNj7DZxAB8SsMfLc1hscfvQz1dXp93+bH3Ji98KTLLvRSEDGAO1abaH8wBp2QmNuLfmk9RM5n+BvDAL6y0sTa23RLzV2K/ax7iDosUzUcOprEmydS6udoBFixJIovLYng2msiMGuyX5YW1RNuXBHFwbd1utGr2D8xI4ik2MFlP7lUFBcnGK+8kZh5/uJh+2e5ibc/FMs6OpN7yjVXR6AZLaKBpOZVq9WZPp/w+vEkXjqcwOQUqoppDeyvkTpg6R+GzltIpeQQVBXhaOD0azntGgyYgQ//p2P3sTUw5By490ePK8u1X4zANDPf2D+6YOGtd+3BgF7wctHCkEP58BG3ttTgG7eZat6RKa0iYrwzqKMggGgRlQoJVGWX20xcvyyKDetqcfsqE7F6yiiGCPH7v45DV0SLqJSr0CCX6Jq776hVj3yQZKMMi/v2XFTDZX3h66JO7RD4kQvjjKMnU3j2lQm898EkdEe0kB7SDJ9yWYzwtVtMLGuOqEnkvoMJ3XtIs2FX1fEvEQNYujiCrZti+NVPPq9m8vpCcUOVOAoAREDzlRGVVpGRmJ5wQ9SpN4Vq49Cs5KJQZwI3LJcEYxRLFkcyDnsFyXFta4vh/OgF/Ou/mk0QiWLRmeJfVcbFeclF4blX7ecy+vrOxno0Xp5ZlSVfiOBba2rxx37NhsDMNe4WrKuEFw8nsHd/Qg13MxGJAC0r5qxea4OhyuL5kGPvpTD2cfZLsfQemUxqBVHKsGsU+o9TZyeRWCDjK72kvlazCTHzuAx7x+BDWlZEEVtgQ8PUFHAxodtghsZk2FuRUnal5tabanD557LfIkUM6UV6wcPSQ4bgMzbeWYcNa+vUWnsmZJR/bkizIa+ChqJS15ZYs2tpHsj+qju/eim5KMGXeYg8FpqHCOMTjBOn9BvLiBZRKTKMKuTrt5jq4RbpHYOnJ/H8oblzGD2g04ZUfEaAODs0pd+E0EG0iEr57Sl/zg/Tesbpc5P43V/G9UuZOIgWhl0LXcpv+5ePLzAGXp7Az57+RON1ETojWjhTVT4CwDcbHZIpVjfuMx9O4cjxJPa/mdR8HWRGA2crKfNBED0MjTj+fgqP/tSXc9bMiAbT+7LEJaDS7Qk6lqOBEsQp9zBY6UYFmMHpkhuzhldi2RBSGS7FfpYg1u4KtSYEl2I/J2eyqWPkH34+RaUj4lmyp7vp5unnc2aEYmZSkVYFGJoX8zmCiLOMY2YSUh7YiXlmQZzz0r1lakwI0Du/KlB6Eouou5wtCjSUHus0QezaG+SqkmZIIVBf/tWACE8V9B4h+ZMlxhkFEeXEc8nFy4e4QGKbzWQs60KIGGApz6UQrxlzYgtXgqi7P/MOz5sTdJh3LOT0lnN3w/2do3vDyqTemYk909X47YV+J+farbiRSRFgj9oUXIiSKpY5yCmIpIXFjcyzhgUUtrgzH5u9vHY3ONZwOz1pWTDZma+9nqsdcm0dI/tDhwTXHOjvbvpmvr/sav+PWMPZbmQheXLWiRlKIojt08dbbDufkDwsj7a49TZ0vUNOPJXEGi5M0+c0BdtciKdhQVsWxX1MrOFCUbLb5hXqZVjwHlLx6RNruPDylWYseW+hHoZCUZt67W8Brw9v9DPWq+uLcfkUQnNiv5kTzya07y4eT88hSMOY0R6I3BdRUj6rl2IInh8MkRSBuJFJZhM+hYj2yWfMNx3i6rVRQpQBFtGTPrJOGpP1jP6eeE+p3qCkR6ek4WKA5YflYCbuks9SSjGEsh2/tZ17xOaHXeV2Kg/1yYaEbGvgnr8byoxjqSTOMu06GMdkQTIQvbJvqlxCTFOxgCg3H7P2EfHP0GWDNwPHZK+tbO9caN27lGjxDbX9M8SyQbkElNsUYNA+n2HtLiQZ6EtB0syQLdxl10KX8tteV92WE8d8RM70yTGyfJZVAy0I5iHlt6XisxQZlrq2TlnbZrt4Jzc4JQrtqnhS+0uVm5IKR1JUh4akXIWqkGDhpJwDt4+B68tnvr6L5zB8YjIAAAAASUVORK5CYII="
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"tables": [],
|
||||
"key_value_items": [],
|
||||
"form_items": [],
|
||||
"pages": {}
|
||||
}
|
13
tests/data/groundtruth/docling_v2/word_image_anchors.docx.md
vendored
Normal file
13
tests/data/groundtruth/docling_v2/word_image_anchors.docx.md
vendored
Normal file
@ -0,0 +1,13 @@
|
||||
**Transcript**
|
||||
|
||||
February 20, 2025, 8:32PM
|
||||
|
||||
<!-- image -->
|
||||
|
||||
**This is test 1** 0:08
|
||||
Correct, he is not.
|
||||
|
||||
<!-- image -->
|
||||
|
||||
**This is test 2** 0:16
|
||||
Yeah, exactly.
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "word_sample",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -106,7 +106,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -149,7 +150,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -167,7 +169,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -217,7 +220,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -235,7 +239,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -255,7 +260,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -275,7 +281,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -295,7 +302,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -313,7 +321,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -333,7 +342,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -353,7 +363,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -373,7 +384,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -426,7 +438,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -444,7 +457,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -462,7 +476,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -492,7 +507,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -510,7 +526,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -530,7 +547,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -550,7 +568,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
},
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
@ -897,7 +916,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "word_tables",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
@ -119,7 +119,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -149,7 +150,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -179,7 +181,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -209,7 +212,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -239,7 +243,8 @@
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
"strikethrough": false,
|
||||
"script": "baseline"
|
||||
}
|
||||
},
|
||||
{
|
||||
@ -510,7 +515,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/1",
|
||||
@ -729,7 +735,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/2",
|
||||
@ -1020,7 +1027,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/3",
|
||||
@ -1387,7 +1395,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
},
|
||||
{
|
||||
"self_ref": "#/tables/4",
|
||||
@ -2398,7 +2407,8 @@
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
],
|
||||
"key_value_items": [],
|
||||
|
12
tests/data/md/inline_and_formatting.md
vendored
12
tests/data/md/inline_and_formatting.md
vendored
@ -11,12 +11,22 @@ Create your feature branch: `git checkout -b feature/AmazingFeature`.
|
||||
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
|
||||
4. Push to the branch (`git push origin feature/AmazingFeature`)
|
||||
5. Open a Pull Request
|
||||
6. **Whole list item has same formatting**
|
||||
7. List item has *mixed or partial* formatting
|
||||
|
||||
## *Second* section <!-- inline groups in headings not yet supported by serializers -->
|
||||
# *Whole heading is italic*
|
||||
|
||||
<<<<<<< HEAD
|
||||
- **First**: Lorem ipsum.
|
||||
- **Second**: Dolor `sit` amet.
|
||||
|
||||
| **Bold Heading** | *Italic Heading* |
|
||||
|------------------|------------------|
|
||||
| data a | data b |
|
||||
=======
|
||||
Some *`formatted_code`*
|
||||
|
||||
## *Partially formatted* heading to_escape `not_to_escape`
|
||||
|
||||
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
|
||||
>>>>>>> origin/main
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "webp-test",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.3.0",
|
||||
"version": "1.4.0",
|
||||
"name": "ocr_test",
|
||||
"origin": {
|
||||
"mimetype": "application/pdf",
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user