Merge remote-tracking branch 'origin/main' into md_table_improvements

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
This commit is contained in:
Michael Honaker 2025-06-25 12:09:38 -04:00
commit 1f47908bc7
106 changed files with 4226 additions and 2401 deletions

View File

@ -22,8 +22,8 @@ jobs:
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
steps:
- uses: actions/checkout@v4
- name: Install tesseract
run: sudo apt-get update && sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa tesseract-ocr-script-latn libleptonica-dev libtesseract-dev pkg-config
- name: Install tesseract and ffmpeg
run: sudo apt-get update && sudo apt-get install -y ffmpeg tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa tesseract-ocr-script-latn libleptonica-dev libtesseract-dev pkg-config
- name: Set TESSDATA_PREFIX
run: |
echo "TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)" >> "$GITHUB_ENV"
@ -60,7 +60,7 @@ jobs:
run: |
for file in docs/examples/*.py; do
# Skip batch_convert.py
if [[ "$(basename "$file")" =~ ^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model).py ]]; then
if [[ "$(basename "$file")" =~ ^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model).py ]]; then
echo "Skipping $file"
continue
fi

View File

@ -1,3 +1,23 @@
## [v2.38.0](https://github.com/docling-project/docling/releases/tag/v2.38.0) - 2025-06-23
### Feature
* Support audio input ([#1763](https://github.com/docling-project/docling/issues/1763)) ([`1557e7c`](https://github.com/docling-project/docling/commit/1557e7ce3e036fb51eb118296f5cbff3b6dfbfa7))
* **markdown:** Add formatting & improve inline support ([#1804](https://github.com/docling-project/docling/issues/1804)) ([`861abcd`](https://github.com/docling-project/docling/commit/861abcdcb0d406342b9566f81203b87cf32b7ad0))
* Maximum image size for Vlm models ([#1802](https://github.com/docling-project/docling/issues/1802)) ([`215b540`](https://github.com/docling-project/docling/commit/215b540f6c078a72464310ef22975ebb6cde4f0a))
### Fix
* **docx:** Ensure list items have a list parent ([#1827](https://github.com/docling-project/docling/issues/1827)) ([`d26dac6`](https://github.com/docling-project/docling/commit/d26dac61a86b0af5b16686f78956ba047bcbddba))
* **msword_backend:** Identify text in the same line after an image #1425 ([#1610](https://github.com/docling-project/docling/issues/1610)) ([`1350a8d`](https://github.com/docling-project/docling/commit/1350a8d3e5ea3c4b4d506757758880c8f78efd8c))
* Ensure uninitialized pages are removed before assembling document ([#1812](https://github.com/docling-project/docling/issues/1812)) ([`dd7f64f`](https://github.com/docling-project/docling/commit/dd7f64ff28226cd9964fc4d8ba807b2c8a6358ef))
* Formula conversion with page_range param set ([#1791](https://github.com/docling-project/docling/issues/1791)) ([`dbab30e`](https://github.com/docling-project/docling/commit/dbab30e92cc1d130ce7f9335ab9c46aa7a30930d))
### Documentation
* Update readme and add ASR example ([#1836](https://github.com/docling-project/docling/issues/1836)) ([`f3ae302`](https://github.com/docling-project/docling/commit/f3ae3029b8a6d6f0109383fbc82ebf9da3942afd))
* Support running examples from root or subfolder ([#1816](https://github.com/docling-project/docling/issues/1816)) ([`64ac043`](https://github.com/docling-project/docling/commit/64ac043786efdece0c61827051a5b41dddf6c5d7))
## [v2.37.0](https://github.com/docling-project/docling/releases/tag/v2.37.0) - 2025-06-16
### Feature

View File

@ -28,14 +28,15 @@ Docling simplifies document processing, parsing diverse formats — including ad
## Features
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 Extensive OCR support for scanned PDFs and images
* 🥚 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models
* 💻 Simple and convenient CLI
### Coming soon

View File

@ -2,9 +2,10 @@ import logging
import re
import warnings
from copy import deepcopy
from enum import Enum
from io import BytesIO
from pathlib import Path
from typing import List, Optional, Set, Union
from typing import List, Literal, Optional, Set, Union
import marko
import marko.element
@ -21,7 +22,8 @@ from docling_core.types.doc import (
)
from docling_core.types.doc.document import Formatting, OrderedList, UnorderedList
from marko import Markdown
from pydantic import AnyUrl, TypeAdapter
from pydantic import AnyUrl, BaseModel, Field, TypeAdapter
from typing_extensions import Annotated
from docling.backend.abstract_backend import DeclarativeDocumentBackend
from docling.backend.html_backend import HTMLDocumentBackend
@ -35,6 +37,31 @@ _START_MARKER = f"#_#_{_MARKER_BODY}_START_#_#"
_STOP_MARKER = f"#_#_{_MARKER_BODY}_STOP_#_#"
class _PendingCreationType(str, Enum):
"""CoordOrigin."""
HEADING = "heading"
LIST_ITEM = "list_item"
class _HeadingCreationPayload(BaseModel):
kind: Literal["heading"] = "heading"
level: int
class _ListItemCreationPayload(BaseModel):
kind: Literal["list_item"] = "list_item"
_CreationPayload = Annotated[
Union[
_HeadingCreationPayload,
_ListItemCreationPayload,
],
Field(discriminator="kind"),
]
class MarkdownDocumentBackend(DeclarativeDocumentBackend):
def _shorten_underscore_sequences(self, markdown_text: str, max_length: int = 10):
# This regex will match any sequence of underscores
@ -155,6 +182,52 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
doc.add_table(data=table_data)
return
def _create_list_item(
self,
doc: DoclingDocument,
parent_item: Optional[NodeItem],
text: str,
formatting: Optional[Formatting] = None,
hyperlink: Optional[Union[AnyUrl, Path]] = None,
):
if not isinstance(parent_item, (OrderedList, UnorderedList)):
_log.warning("ListItem would have not had a list parent, adding one.")
parent_item = doc.add_unordered_list(parent=parent_item)
item = doc.add_list_item(
text=text,
enumerated=(isinstance(parent_item, OrderedList)),
parent=parent_item,
formatting=formatting,
hyperlink=hyperlink,
)
return item
def _create_heading_item(
self,
doc: DoclingDocument,
parent_item: Optional[NodeItem],
text: str,
level: int,
formatting: Optional[Formatting] = None,
hyperlink: Optional[Union[AnyUrl, Path]] = None,
):
if level == 1:
item = doc.add_title(
text=text,
parent=parent_item,
formatting=formatting,
hyperlink=hyperlink,
)
else:
item = doc.add_heading(
text=text,
level=level - 1,
parent=parent_item,
formatting=formatting,
hyperlink=hyperlink,
)
return item
def _iterate_elements( # noqa: C901
self,
*,
@ -162,6 +235,9 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
depth: int,
doc: DoclingDocument,
visited: Set[marko.element.Element],
creation_stack: list[
_CreationPayload
], # stack for lazy item creation triggered deep in marko's AST (on RawText)
parent_item: Optional[NodeItem] = None,
formatting: Optional[Formatting] = None,
hyperlink: Optional[Union[AnyUrl, Path]] = None,
@ -177,28 +253,17 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
f" - Heading level {element.level}, content: {element.children[0].children}" # type: ignore
)
if len(element.children) == 1:
child = element.children[0]
snippet_text = str(child.children) # type: ignore
visited.add(child)
else:
snippet_text = "" # inline group will be created
if element.level == 1:
parent_item = doc.add_title(
text=snippet_text,
parent=parent_item,
if len(element.children) > 1: # inline group will be created further down
parent_item = self._create_heading_item(
doc=doc,
parent_item=parent_item,
text="",
level=element.level,
formatting=formatting,
hyperlink=hyperlink,
)
else:
parent_item = doc.add_heading(
text=snippet_text,
level=element.level - 1,
parent=parent_item,
formatting=formatting,
hyperlink=hyperlink,
)
creation_stack.append(_HeadingCreationPayload(level=element.level))
elif isinstance(element, marko.block.List):
has_non_empty_list_items = False
@ -224,22 +289,16 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
self._close_table(doc)
_log.debug(" - List item")
if len(child.children) == 1:
snippet_text = str(child.children[0].children) # type: ignore
visited.add(child)
else:
snippet_text = "" # inline group will be created
is_numbered = isinstance(parent_item, OrderedList)
if not isinstance(parent_item, (OrderedList, UnorderedList)):
_log.warning("ListItem would have not had a list parent, adding one.")
parent_item = doc.add_unordered_list(parent=parent_item)
parent_item = doc.add_list_item(
enumerated=is_numbered,
parent=parent_item,
text=snippet_text,
if len(child.children) > 1: # inline group will be created further down
parent_item = self._create_list_item(
doc=doc,
parent_item=parent_item,
text="",
formatting=formatting,
hyperlink=hyperlink,
)
else:
creation_stack.append(_ListItemCreationPayload())
elif isinstance(element, marko.inline.Image):
self._close_table(doc)
@ -285,6 +344,31 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
self.md_table_buffer.append(snippet_text)
elif snippet_text:
self._close_table(doc)
if creation_stack:
while len(creation_stack) > 0:
to_create = creation_stack.pop()
if isinstance(to_create, _ListItemCreationPayload):
parent_item = self._create_list_item(
doc=doc,
parent_item=parent_item,
text=snippet_text,
formatting=formatting,
hyperlink=hyperlink,
)
elif isinstance(to_create, _HeadingCreationPayload):
# not keeping as parent_item as logic for correctly tracking
# that not implemented yet (section components not captured
# as heading children in marko)
self._create_heading_item(
doc=doc,
parent_item=parent_item,
text=snippet_text,
level=to_create.level,
formatting=formatting,
hyperlink=hyperlink,
)
else:
doc.add_text(
label=DocItemLabel.TEXT,
parent=parent_item,
@ -353,7 +437,6 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
parent_item = doc.add_inline_group(parent=parent_item)
processed_block_types = (
# marko.block.Heading,
marko.block.CodeBlock,
marko.block.FencedCode,
marko.inline.RawText,
@ -369,6 +452,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
depth=depth + 1,
doc=doc,
visited=visited,
creation_stack=creation_stack,
parent_item=parent_item,
formatting=formatting,
hyperlink=hyperlink,
@ -412,6 +496,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
doc=doc,
parent_item=None,
visited=set(),
creation_stack=[],
)
self._close_table(doc=doc) # handle any last hanging table

View File

@ -14,7 +14,7 @@ from docling_core.types.doc import (
TableCell,
TableData,
)
from docling_core.types.doc.document import Formatting
from docling_core.types.doc.document import Formatting, OrderedList, UnorderedList
from docx import Document
from docx.document import Document as DocxDocument
from docx.oxml.table import CT_Tc
@ -84,7 +84,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
self.valid = True
except Exception as e:
raise RuntimeError(
f"MsPowerpointDocumentBackend could not load document with hash {self.document_hash}"
f"MsWordDocumentBackend could not load document with hash {self.document_hash}"
) from e
@override
@ -251,9 +251,15 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
self._handle_tables(element, docx_obj, doc)
except Exception:
_log.debug("could not parse a table, broken docx table")
# Check for Image
elif drawing_blip:
self._handle_pictures(docx_obj, drawing_blip, doc)
# Check for Text after the Image
if (
tag_name in ["p"]
and element.find(".//w:t", namespaces=namespaces) is not None
):
self._handle_text_elements(element, docx_obj, doc)
# Check for the sdt containers, like table of contents
elif tag_name in ["sdt"]:
sdt_content = element.find(".//w:sdtContent", namespaces=namespaces)
@ -268,6 +274,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
self._handle_text_elements(element, docx_obj, doc)
else:
_log.debug(f"Ignoring element in DOCX with tag: {tag_name}")
return doc
def _str_to_int(
@ -390,7 +397,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if isinstance(c, Hyperlink):
text = c.text
hyperlink = Path(c.address)
format = self._get_format_from_run(c.runs[0])
format = (
self._get_format_from_run(c.runs[0])
if c.runs and len(c.runs) > 0
else None
)
elif isinstance(c, Run):
text = c.text
hyperlink = None
@ -578,7 +589,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
all_paragraphs = []
# Sort paragraphs within each container, then process containers
for container_id, paragraphs in container_paragraphs.items():
for paragraphs in container_paragraphs.values():
# Sort by vertical position within each container
sorted_container_paragraphs = sorted(
paragraphs,
@ -689,14 +700,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
doc: DoclingDocument,
) -> None:
paragraph = Paragraph(element, docx_obj)
paragraph_elements = self._get_paragraph_elements(paragraph)
text, equations = self._handle_equations_in_text(
element=element, text=paragraph.text
)
if text is None:
return
paragraph_elements = self._get_paragraph_elements(paragraph)
text = text.strip()
# Common styles for bullet and numbered lists.
@ -912,6 +922,44 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
)
return
def _add_formatted_list_item(
self,
doc: DoclingDocument,
elements: list,
marker: str,
enumerated: bool,
level: int,
) -> None:
# This should not happen by construction
if not isinstance(self.parents[level], (OrderedList, UnorderedList)):
return
if len(elements) == 1:
text, format, hyperlink = elements[0]
doc.add_list_item(
marker=marker,
enumerated=enumerated,
parent=self.parents[level],
text=text,
formatting=format,
hyperlink=hyperlink,
)
else:
new_item = doc.add_list_item(
marker=marker,
enumerated=enumerated,
parent=self.parents[level],
text="",
)
new_parent = doc.add_group(label=GroupLabel.INLINE, parent=new_item)
for text, format, hyperlink in elements:
doc.add_text(
label=DocItemLabel.TEXT,
parent=new_parent,
text=text,
formatting=format,
hyperlink=hyperlink,
)
def _add_list_item(
self,
*,
@ -921,6 +969,9 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
elements: list,
is_numbered: bool = False,
) -> None:
# TODO: this method is always called with is_numbered. Numbered lists should be properly addressed.
if not elements:
return None
enum_marker = ""
level = self._get_level()
@ -937,21 +988,9 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if is_numbered:
enum_marker = str(self.listIter) + "."
is_numbered = True
new_parent = self._create_or_reuse_parent(
doc=doc,
prev_parent=self.parents[level],
paragraph_elements=elements,
self._add_formatted_list_item(
doc, elements, enum_marker, is_numbered, level
)
for text, format, hyperlink in elements:
doc.add_list_item(
marker=enum_marker,
enumerated=is_numbered,
parent=new_parent,
text=text,
formatting=format,
hyperlink=hyperlink,
)
elif (
self._prev_numid() == numid
and self.level_at_new_list is not None
@ -981,20 +1020,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if is_numbered:
enum_marker = str(self.listIter) + "."
is_numbered = True
new_parent = self._create_or_reuse_parent(
doc=doc,
prev_parent=self.parents[self.level_at_new_list + ilevel],
paragraph_elements=elements,
)
for text, format, hyperlink in elements:
doc.add_list_item(
marker=enum_marker,
enumerated=is_numbered,
parent=new_parent,
text=text,
formatting=format,
hyperlink=hyperlink,
self._add_formatted_list_item(
doc,
elements,
enum_marker,
is_numbered,
self.level_at_new_list + ilevel,
)
elif (
self._prev_numid() == numid
@ -1002,7 +1033,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
and prev_indent is not None
and ilevel < prev_indent
): # Close list
for k, v in self.parents.items():
for k in self.parents:
if k > self.level_at_new_list + ilevel:
self.parents[k] = None
@ -1011,19 +1042,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if is_numbered:
enum_marker = str(self.listIter) + "."
is_numbered = True
new_parent = self._create_or_reuse_parent(
doc=doc,
prev_parent=self.parents[self.level_at_new_list + ilevel],
paragraph_elements=elements,
)
for text, format, hyperlink in elements:
doc.add_list_item(
marker=enum_marker,
enumerated=is_numbered,
parent=new_parent,
text=text,
formatting=format,
hyperlink=hyperlink,
self._add_formatted_list_item(
doc,
elements,
enum_marker,
is_numbered,
self.level_at_new_list + ilevel,
)
self.listIter = 0
@ -1033,21 +1057,10 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if is_numbered:
enum_marker = str(self.listIter) + "."
is_numbered = True
new_parent = self._create_or_reuse_parent(
doc=doc,
prev_parent=self.parents[level - 1],
paragraph_elements=elements,
)
for text, format, hyperlink in elements:
# Add the list item to the parent group
doc.add_list_item(
marker=enum_marker,
enumerated=is_numbered,
parent=new_parent,
text=text,
formatting=format,
hyperlink=hyperlink,
self._add_formatted_list_item(
doc, elements, enum_marker, is_numbered, level - 1
)
return
def _handle_tables(

View File

@ -0,0 +1,51 @@
import logging
from io import BytesIO
from pathlib import Path
from typing import Set, Union
from docling.backend.abstract_backend import AbstractDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument
_log = logging.getLogger(__name__)
class NoOpBackend(AbstractDocumentBackend):
"""
A no-op backend that only validates input existence.
Used e.g. for audio files where actual processing is handled by the ASR pipeline.
"""
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
super().__init__(in_doc, path_or_stream)
_log.debug(f"NoOpBackend initialized for: {path_or_stream}")
# Validate input
try:
if isinstance(self.path_or_stream, BytesIO):
# Check if stream has content
self.valid = len(self.path_or_stream.getvalue()) > 0
_log.debug(
f"BytesIO stream length: {len(self.path_or_stream.getvalue())}"
)
elif isinstance(self.path_or_stream, Path):
# Check if file exists
self.valid = self.path_or_stream.exists()
_log.debug(f"File exists: {self.valid}")
else:
self.valid = False
except Exception as e:
_log.error(f"NoOpBackend validation failed: {e}")
self.valid = False
def is_valid(self) -> bool:
return self.valid
@classmethod
def supports_pagination(cls) -> bool:
return False
@classmethod
def supported_formats(cls) -> Set[InputFormat]:
return set(InputFormat)

View File

@ -29,6 +29,15 @@ from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBacke
from docling.backend.pdf_backend import PdfDocumentBackend
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.asr_model_specs import (
WHISPER_BASE,
WHISPER_LARGE,
WHISPER_MEDIUM,
WHISPER_SMALL,
WHISPER_TINY,
WHISPER_TURBO,
AsrModelType,
)
from docling.datamodel.base_models import (
ConversionStatus,
FormatToExtensions,
@ -37,12 +46,14 @@ from docling.datamodel.base_models import (
)
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import (
AsrPipelineOptions,
EasyOcrOptions,
OcrOptions,
PaginatedPipelineOptions,
PdfBackend,
PdfPipeline,
PdfPipelineOptions,
PipelineOptions,
ProcessingPipeline,
TableFormerMode,
VlmPipelineOptions,
)
@ -54,8 +65,14 @@ from docling.datamodel.vlm_model_specs import (
SMOLDOCLING_TRANSFORMERS,
VlmModelType,
)
from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption
from docling.document_converter import (
AudioFormatOption,
DocumentConverter,
FormatOption,
PdfFormatOption,
)
from docling.models.factories import get_ocr_factory
from docling.pipeline.asr_pipeline import AsrPipeline
from docling.pipeline.vlm_pipeline import VlmPipeline
warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
@ -296,13 +313,17 @@ def convert( # noqa: C901
),
] = ImageRefMode.EMBEDDED,
pipeline: Annotated[
PdfPipeline,
ProcessingPipeline,
typer.Option(..., help="Choose the pipeline to process PDF or image files."),
] = PdfPipeline.STANDARD,
] = ProcessingPipeline.STANDARD,
vlm_model: Annotated[
VlmModelType,
typer.Option(..., help="Choose the VLM model to use with PDF or image files."),
] = VlmModelType.SMOLDOCLING,
asr_model: Annotated[
AsrModelType,
typer.Option(..., help="Choose the ASR model to use with audio/video files."),
] = AsrModelType.WHISPER_TINY,
ocr: Annotated[
bool,
typer.Option(
@ -450,12 +471,14 @@ def convert( # noqa: C901
),
] = None,
):
log_format = "%(asctime)s\t%(levelname)s\t%(name)s: %(message)s"
if verbose == 0:
logging.basicConfig(level=logging.WARNING)
logging.basicConfig(level=logging.WARNING, format=log_format)
elif verbose == 1:
logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.INFO, format=log_format)
else:
logging.basicConfig(level=logging.DEBUG)
logging.basicConfig(level=logging.DEBUG, format=log_format)
settings.debug.visualize_cells = debug_visualize_cells
settings.debug.visualize_layout = debug_visualize_layout
@ -530,9 +553,12 @@ def convert( # noqa: C901
ocr_options.lang = ocr_lang_list
accelerator_options = AcceleratorOptions(num_threads=num_threads, device=device)
pipeline_options: PaginatedPipelineOptions
# pipeline_options: PaginatedPipelineOptions
pipeline_options: PipelineOptions
if pipeline == PdfPipeline.STANDARD:
format_options: Dict[InputFormat, FormatOption] = {}
if pipeline == ProcessingPipeline.STANDARD:
pipeline_options = PdfPipelineOptions(
allow_external_plugins=allow_external_plugins,
enable_remote_services=enable_remote_services,
@ -574,7 +600,13 @@ def convert( # noqa: C901
pipeline_options=pipeline_options,
backend=backend, # pdf_backend
)
elif pipeline == PdfPipeline.VLM:
format_options = {
InputFormat.PDF: pdf_format_option,
InputFormat.IMAGE: pdf_format_option,
}
elif pipeline == ProcessingPipeline.VLM:
pipeline_options = VlmPipelineOptions(
enable_remote_services=enable_remote_services,
)
@ -600,13 +632,48 @@ def convert( # noqa: C901
pipeline_cls=VlmPipeline, pipeline_options=pipeline_options
)
if artifacts_path is not None:
pipeline_options.artifacts_path = artifacts_path
format_options: Dict[InputFormat, FormatOption] = {
format_options = {
InputFormat.PDF: pdf_format_option,
InputFormat.IMAGE: pdf_format_option,
}
elif pipeline == ProcessingPipeline.ASR:
pipeline_options = AsrPipelineOptions(
# enable_remote_services=enable_remote_services,
# artifacts_path = artifacts_path
)
if asr_model == AsrModelType.WHISPER_TINY:
pipeline_options.asr_options = WHISPER_TINY
elif asr_model == AsrModelType.WHISPER_SMALL:
pipeline_options.asr_options = WHISPER_SMALL
elif asr_model == AsrModelType.WHISPER_MEDIUM:
pipeline_options.asr_options = WHISPER_MEDIUM
elif asr_model == AsrModelType.WHISPER_BASE:
pipeline_options.asr_options = WHISPER_BASE
elif asr_model == AsrModelType.WHISPER_LARGE:
pipeline_options.asr_options = WHISPER_LARGE
elif asr_model == AsrModelType.WHISPER_TURBO:
pipeline_options.asr_options = WHISPER_TURBO
else:
_log.error(f"{asr_model} is not known")
raise ValueError(f"{asr_model} is not known")
_log.info(f"pipeline_options: {pipeline_options}")
audio_format_option = AudioFormatOption(
pipeline_cls=AsrPipeline,
pipeline_options=pipeline_options,
)
format_options = {
InputFormat.AUDIO: audio_format_option,
}
if artifacts_path is not None:
pipeline_options.artifacts_path = artifacts_path
# audio_pipeline_options.artifacts_path = artifacts_path
doc_converter = DocumentConverter(
allowed_formats=from_formats,
format_options=format_options,
@ -614,6 +681,7 @@ def convert( # noqa: C901
start_time = time.time()
_log.info(f"paths: {input_doc_paths}")
conv_results = doc_converter.convert_all(
input_doc_paths, headers=parsed_headers, raises_on_error=abort_on_error
)

View File

@ -0,0 +1,92 @@
import logging
from enum import Enum
from pydantic import (
AnyUrl,
)
from docling.datamodel.accelerator_options import AcceleratorDevice
from docling.datamodel.pipeline_options_asr_model import (
# AsrResponseFormat,
# ApiAsrOptions,
InferenceAsrFramework,
InlineAsrNativeWhisperOptions,
TransformersModelType,
)
_log = logging.getLogger(__name__)
WHISPER_TINY = InlineAsrNativeWhisperOptions(
repo_id="tiny",
inference_framework=InferenceAsrFramework.WHISPER,
verbose=True,
timestamps=True,
word_timestamps=True,
temperatue=0.0,
max_new_tokens=256,
max_time_chunk=30.0,
)
WHISPER_SMALL = InlineAsrNativeWhisperOptions(
repo_id="small",
inference_framework=InferenceAsrFramework.WHISPER,
verbose=True,
timestamps=True,
word_timestamps=True,
temperatue=0.0,
max_new_tokens=256,
max_time_chunk=30.0,
)
WHISPER_MEDIUM = InlineAsrNativeWhisperOptions(
repo_id="medium",
inference_framework=InferenceAsrFramework.WHISPER,
verbose=True,
timestamps=True,
word_timestamps=True,
temperatue=0.0,
max_new_tokens=256,
max_time_chunk=30.0,
)
WHISPER_BASE = InlineAsrNativeWhisperOptions(
repo_id="base",
inference_framework=InferenceAsrFramework.WHISPER,
verbose=True,
timestamps=True,
word_timestamps=True,
temperatue=0.0,
max_new_tokens=256,
max_time_chunk=30.0,
)
WHISPER_LARGE = InlineAsrNativeWhisperOptions(
repo_id="large",
inference_framework=InferenceAsrFramework.WHISPER,
verbose=True,
timestamps=True,
word_timestamps=True,
temperatue=0.0,
max_new_tokens=256,
max_time_chunk=30.0,
)
WHISPER_TURBO = InlineAsrNativeWhisperOptions(
repo_id="turbo",
inference_framework=InferenceAsrFramework.WHISPER,
verbose=True,
timestamps=True,
word_timestamps=True,
temperatue=0.0,
max_new_tokens=256,
max_time_chunk=30.0,
)
class AsrModelType(str, Enum):
WHISPER_TINY = "whisper_tiny"
WHISPER_SMALL = "whisper_small"
WHISPER_MEDIUM = "whisper_medium"
WHISPER_BASE = "whisper_base"
WHISPER_LARGE = "whisper_large"
WHISPER_TURBO = "whisper_turbo"

View File

@ -49,6 +49,7 @@ class InputFormat(str, Enum):
XML_USPTO = "xml_uspto"
XML_JATS = "xml_jats"
JSON_DOCLING = "json_docling"
AUDIO = "audio"
class OutputFormat(str, Enum):
@ -73,6 +74,7 @@ FormatToExtensions: Dict[InputFormat, List[str]] = {
InputFormat.XLSX: ["xlsx", "xlsm"],
InputFormat.XML_USPTO: ["xml", "txt"],
InputFormat.JSON_DOCLING: ["json"],
InputFormat.AUDIO: ["wav", "mp3"],
}
FormatToMimeType: Dict[InputFormat, List[str]] = {
@ -104,6 +106,7 @@ FormatToMimeType: Dict[InputFormat, List[str]] = {
],
InputFormat.XML_USPTO: ["application/xml", "text/plain"],
InputFormat.JSON_DOCLING: ["application/json"],
InputFormat.AUDIO: ["audio/x-wav", "audio/mpeg", "audio/wav", "audio/mp3"],
}
MimeTypeToFormat: dict[str, list[InputFormat]] = {
@ -298,7 +301,7 @@ class OpenAiChatMessage(BaseModel):
class OpenAiResponseChoice(BaseModel):
index: int
message: OpenAiChatMessage
finish_reason: str
finish_reason: Optional[str]
class OpenAiResponseUsage(BaseModel):

View File

@ -249,7 +249,7 @@ class _DocumentConversionInput(BaseModel):
backend: Type[AbstractDocumentBackend]
if format not in format_options.keys():
_log.error(
f"Input document {obj.name} does not match any allowed format."
f"Input document {obj.name} with format {format} does not match any allowed format: ({format_options.keys()})"
)
backend = _DummyBackend
else:
@ -318,6 +318,8 @@ class _DocumentConversionInput(BaseModel):
mime = mime or _DocumentConversionInput._detect_csv(content)
mime = mime or "text/plain"
formats = MimeTypeToFormat.get(mime, [])
_log.info(f"detected formats: {formats}")
if formats:
if len(formats) == 1 and mime not in ("text/plain"):
return formats[0]

View File

@ -11,8 +11,13 @@ from pydantic import (
)
from typing_extensions import deprecated
from docling.datamodel import asr_model_specs
# Import the following for backwards compatibility
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options_asr_model import (
InlineAsrOptions,
)
from docling.datamodel.pipeline_options_vlm_model import (
ApiVlmOptions,
InferenceFramework,
@ -202,7 +207,7 @@ smolvlm_picture_description = PictureDescriptionVlmOptions(
# GraniteVision
granite_picture_description = PictureDescriptionVlmOptions(
repo_id="ibm-granite/granite-vision-3.1-2b-preview",
repo_id="ibm-granite/granite-vision-3.2-2b-preview",
prompt="What is shown in this image?",
)
@ -260,6 +265,11 @@ class VlmPipelineOptions(PaginatedPipelineOptions):
)
class AsrPipelineOptions(PipelineOptions):
asr_options: Union[InlineAsrOptions] = asr_model_specs.WHISPER_TINY
artifacts_path: Optional[Union[Path, str]] = None
class PdfPipelineOptions(PaginatedPipelineOptions):
"""Options for the PDF pipeline."""
@ -297,6 +307,7 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
)
class PdfPipeline(str, Enum):
class ProcessingPipeline(str, Enum):
STANDARD = "standard"
VLM = "vlm"
ASR = "asr"

View File

@ -0,0 +1,57 @@
from enum import Enum
from typing import Any, Dict, List, Literal, Optional, Union
from pydantic import AnyUrl, BaseModel
from typing_extensions import deprecated
from docling.datamodel.accelerator_options import AcceleratorDevice
from docling.datamodel.pipeline_options_vlm_model import (
# InferenceFramework,
TransformersModelType,
)
class BaseAsrOptions(BaseModel):
kind: str
# prompt: str
class InferenceAsrFramework(str, Enum):
# MLX = "mlx" # disabled for now
# TRANSFORMERS = "transformers" # disabled for now
WHISPER = "whisper"
class InlineAsrOptions(BaseAsrOptions):
kind: Literal["inline_model_options"] = "inline_model_options"
repo_id: str
verbose: bool = False
timestamps: bool = True
temperature: float = 0.0
max_new_tokens: int = 256
max_time_chunk: float = 30.0
torch_dtype: Optional[str] = None
supported_devices: List[AcceleratorDevice] = [
AcceleratorDevice.CPU,
AcceleratorDevice.CUDA,
AcceleratorDevice.MPS,
]
@property
def repo_cache_folder(self) -> str:
return self.repo_id.replace("/", "--")
class InlineAsrNativeWhisperOptions(InlineAsrOptions):
inference_framework: InferenceAsrFramework = InferenceAsrFramework.WHISPER
language: str = "en"
supported_devices: List[AcceleratorDevice] = [
AcceleratorDevice.CPU,
AcceleratorDevice.CUDA,
]
word_timestamps: bool = True

View File

@ -19,6 +19,7 @@ from docling.backend.md_backend import MarkdownDocumentBackend
from docling.backend.msexcel_backend import MsExcelDocumentBackend
from docling.backend.mspowerpoint_backend import MsPowerpointDocumentBackend
from docling.backend.msword_backend import MsWordDocumentBackend
from docling.backend.noop_backend import NoOpBackend
from docling.backend.xml.jats_backend import JatsDocumentBackend
from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend
from docling.datamodel.base_models import (
@ -41,6 +42,7 @@ from docling.datamodel.settings import (
settings,
)
from docling.exceptions import ConversionError
from docling.pipeline.asr_pipeline import AsrPipeline
from docling.pipeline.base_pipeline import BasePipeline
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
@ -118,6 +120,11 @@ class PdfFormatOption(FormatOption):
backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend
class AudioFormatOption(FormatOption):
pipeline_cls: Type = AsrPipeline
backend: Type[AbstractDocumentBackend] = NoOpBackend
def _get_default_option(format: InputFormat) -> FormatOption:
format_to_default_options = {
InputFormat.CSV: FormatOption(
@ -156,6 +163,7 @@ def _get_default_option(format: InputFormat) -> FormatOption:
InputFormat.JSON_DOCLING: FormatOption(
pipeline_cls=SimplePipeline, backend=DoclingJSONBackend
),
InputFormat.AUDIO: FormatOption(pipeline_cls=AsrPipeline, backend=NoOpBackend),
}
if (options := format_to_default_options.get(format)) is not None:
return options

View File

@ -124,7 +124,7 @@ class ReadingOrderModel:
page_no = page.page_no + 1
size = page.size
assert size is not None
assert size is not None, "Page size is not initialized."
out_doc.add_page(page_no=page_no, size=size)

View File

@ -0,0 +1,253 @@
import logging
import os
import re
from io import BytesIO
from pathlib import Path
from typing import List, Optional, Union, cast
from docling_core.types.doc import DoclingDocument, DocumentOrigin
# import whisper # type: ignore
# import librosa
# import numpy as np
# import soundfile as sf # type: ignore
from docling_core.types.doc.labels import DocItemLabel
from pydantic import BaseModel, Field, validator
from docling.backend.abstract_backend import AbstractDocumentBackend
from docling.backend.noop_backend import NoOpBackend
# from pydub import AudioSegment # type: ignore
# from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline
from docling.datamodel.accelerator_options import (
AcceleratorOptions,
)
from docling.datamodel.base_models import (
ConversionStatus,
FormatToMimeType,
)
from docling.datamodel.document import ConversionResult, InputDocument
from docling.datamodel.pipeline_options import (
AsrPipelineOptions,
)
from docling.datamodel.pipeline_options_asr_model import (
InlineAsrNativeWhisperOptions,
# AsrResponseFormat,
InlineAsrOptions,
)
from docling.datamodel.pipeline_options_vlm_model import (
InferenceFramework,
)
from docling.datamodel.settings import settings
from docling.pipeline.base_pipeline import BasePipeline
from docling.utils.accelerator_utils import decide_device
from docling.utils.profiling import ProfilingScope, TimeRecorder
_log = logging.getLogger(__name__)
class _ConversationWord(BaseModel):
text: str
start_time: Optional[float] = Field(
None, description="Start time in seconds from video start"
)
end_time: Optional[float] = Field(
None, ge=0, description="End time in seconds from video start"
)
class _ConversationItem(BaseModel):
text: str
start_time: Optional[float] = Field(
None, description="Start time in seconds from video start"
)
end_time: Optional[float] = Field(
None, ge=0, description="End time in seconds from video start"
)
speaker_id: Optional[int] = Field(None, description="Numeric speaker identifier")
speaker: Optional[str] = Field(
None, description="Speaker name, defaults to speaker-{speaker_id}"
)
words: Optional[list[_ConversationWord]] = Field(
None, description="Individual words with time-stamps"
)
def __lt__(self, other):
if not isinstance(other, _ConversationItem):
return NotImplemented
return self.start_time < other.start_time
def __eq__(self, other):
if not isinstance(other, _ConversationItem):
return NotImplemented
return self.start_time == other.start_time
def to_string(self) -> str:
"""Format the conversation entry as a string"""
result = ""
if (self.start_time is not None) and (self.end_time is not None):
result += f"[time: {self.start_time}-{self.end_time}] "
if self.speaker is not None:
result += f"[speaker:{self.speaker}] "
result += self.text
return result
class _NativeWhisperModel:
def __init__(
self,
enabled: bool,
artifacts_path: Optional[Path],
accelerator_options: AcceleratorOptions,
asr_options: InlineAsrNativeWhisperOptions,
):
"""
Transcriber using native Whisper.
"""
self.enabled = enabled
_log.info(f"artifacts-path: {artifacts_path}")
_log.info(f"accelerator_options: {accelerator_options}")
if self.enabled:
try:
import whisper # type: ignore
except ImportError:
raise ImportError(
"whisper is not installed. Please install it via `pip install openai-whisper` or do `uv sync --extra asr`."
)
self.asr_options = asr_options
self.max_tokens = asr_options.max_new_tokens
self.temperature = asr_options.temperature
self.device = decide_device(
accelerator_options.device,
supported_devices=asr_options.supported_devices,
)
_log.info(f"Available device for Whisper: {self.device}")
self.model_name = asr_options.repo_id
_log.info(f"loading _NativeWhisperModel({self.model_name})")
if artifacts_path is not None:
_log.info(f"loading {self.model_name} from {artifacts_path}")
self.model = whisper.load_model(
name=self.model_name,
device=self.device,
download_root=str(artifacts_path),
)
else:
self.model = whisper.load_model(
name=self.model_name, device=self.device
)
self.verbose = asr_options.verbose
self.timestamps = asr_options.timestamps
self.word_timestamps = asr_options.word_timestamps
def run(self, conv_res: ConversionResult) -> ConversionResult:
audio_path: Path = Path(conv_res.input.file).resolve()
try:
conversation = self.transcribe(audio_path)
# Ensure we have a proper DoclingDocument
origin = DocumentOrigin(
filename=conv_res.input.file.name or "audio.wav",
mimetype="audio/x-wav",
binary_hash=conv_res.input.document_hash,
)
conv_res.document = DoclingDocument(
name=conv_res.input.file.stem or "audio.wav", origin=origin
)
for citem in conversation:
conv_res.document.add_text(
label=DocItemLabel.TEXT, text=citem.to_string()
)
conv_res.status = ConversionStatus.SUCCESS
return conv_res
except Exception as exc:
_log.error(f"Audio tranciption has an error: {exc}")
conv_res.status = ConversionStatus.FAILURE
return conv_res
def transcribe(self, fpath: Path) -> list[_ConversationItem]:
result = self.model.transcribe(
str(fpath), verbose=self.verbose, word_timestamps=self.word_timestamps
)
convo: list[_ConversationItem] = []
for _ in result["segments"]:
item = _ConversationItem(
start_time=_["start"], end_time=_["end"], text=_["text"], words=[]
)
if "words" in _ and self.word_timestamps:
item.words = []
for __ in _["words"]:
item.words.append(
_ConversationWord(
start_time=__["start"],
end_time=__["end"],
text=__["word"],
)
)
convo.append(item)
return convo
class AsrPipeline(BasePipeline):
def __init__(self, pipeline_options: AsrPipelineOptions):
super().__init__(pipeline_options)
self.keep_backend = True
self.pipeline_options: AsrPipelineOptions = pipeline_options
artifacts_path: Optional[Path] = None
if pipeline_options.artifacts_path is not None:
artifacts_path = Path(pipeline_options.artifacts_path).expanduser()
elif settings.artifacts_path is not None:
artifacts_path = Path(settings.artifacts_path).expanduser()
if artifacts_path is not None and not artifacts_path.is_dir():
raise RuntimeError(
f"The value of {artifacts_path=} is not valid. "
"When defined, it must point to a folder containing all models required by the pipeline."
)
if isinstance(self.pipeline_options.asr_options, InlineAsrNativeWhisperOptions):
asr_options: InlineAsrNativeWhisperOptions = (
self.pipeline_options.asr_options
)
self._model = _NativeWhisperModel(
enabled=True, # must be always enabled for this pipeline to make sense.
artifacts_path=artifacts_path,
accelerator_options=pipeline_options.accelerator_options,
asr_options=asr_options,
)
else:
_log.error(f"No model support for {self.pipeline_options.asr_options}")
def _determine_status(self, conv_res: ConversionResult) -> ConversionStatus:
status = ConversionStatus.SUCCESS
return status
@classmethod
def get_default_options(cls) -> AsrPipelineOptions:
return AsrPipelineOptions()
def _build_document(self, conv_res: ConversionResult) -> ConversionResult:
_log.info(f"start _build_document in AsrPipeline: {conv_res.input.file}")
with TimeRecorder(conv_res, "doc_build", scope=ProfilingScope.DOCUMENT):
self._model.run(conv_res=conv_res)
return conv_res
@classmethod
def is_backend_supported(cls, backend: AbstractDocumentBackend):
return isinstance(backend, NoOpBackend)

View File

@ -193,6 +193,17 @@ class PaginatedPipeline(BasePipeline): # TODO this is a bad name.
)
raise e
# Filter out uninitialized pages (those with size=None) that may remain
# after timeout or processing failures to prevent assertion errors downstream
initial_page_count = len(conv_res.pages)
conv_res.pages = [page for page in conv_res.pages if page.size is not None]
if len(conv_res.pages) < initial_page_count:
_log.info(
f"Filtered out {initial_page_count - len(conv_res.pages)} uninitialized pages "
f"due to timeout or processing failures"
)
return conv_res
def _unload(self, conv_res: ConversionResult) -> ConversionResult:

View File

@ -121,14 +121,15 @@ def export_documents(
def main():
logging.basicConfig(level=logging.INFO)
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_paths = [
Path("./tests/data/pdf/2206.01062.pdf"),
Path("./tests/data/pdf/2203.01017v2.pdf"),
Path("./tests/data/pdf/2305.03393v1.pdf"),
Path("./tests/data/pdf/redp5110_sampled.pdf"),
data_folder / "pdf/2206.01062.pdf",
data_folder / "pdf/2203.01017v2.pdf",
data_folder / "pdf/2305.03393v1.pdf",
data_folder / "pdf/redp5110_sampled.pdf",
]
# buf = BytesIO(Path("./test/data/2206.01062.pdf").open("rb").read())
# buf = BytesIO((data_folder / "pdf/2206.01062.pdf").open("rb").read())
# docs = [DocumentStream(name="my_doc.pdf", stream=buf)]
# input = DocumentConversionInput.from_streams(docs)

View File

@ -16,7 +16,8 @@ _log = logging.getLogger(__name__)
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
###########################################################################

View File

@ -71,7 +71,8 @@ class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path("./tests/data/pdf/2203.01017v2.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2203.01017v2.pdf"
pipeline_options = ExampleFormulaUnderstandingPipelineOptions()
pipeline_options.do_formula_understanding = True

View File

@ -76,7 +76,8 @@ class ExamplePictureClassifierPipeline(StandardPdfPipeline):
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
pipeline_options = ExamplePictureClassifierPipelineOptions()
pipeline_options.images_scale = 2.0

View File

@ -16,7 +16,8 @@ IMAGE_RESOLUTION_SCALE = 2.0
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
output_dir = Path("scratch")
# Important: For operating with page images, we must keep them, otherwise the DocumentConverter

View File

@ -19,7 +19,8 @@ IMAGE_RESOLUTION_SCALE = 2.0
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
output_dir = Path("scratch")
# Important: For operating with page images, we must keep them, otherwise the DocumentConverter

View File

@ -12,7 +12,8 @@ _log = logging.getLogger(__name__)
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
output_dir = Path("scratch")
doc_converter = DocumentConverter()

View File

@ -9,7 +9,8 @@ from docling.document_converter import DocumentConverter, PdfFormatOption
def main():
input_doc = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
@ -32,7 +33,7 @@ def main():
}
)
doc = converter.convert(input_doc).document
doc = converter.convert(input_doc_path).document
md = doc.export_to_markdown()
print(md)

56
docs/examples/minimal_asr_pipeline.py vendored Normal file
View File

@ -0,0 +1,56 @@
from pathlib import Path
from docling_core.types.doc import DoclingDocument
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline
def get_asr_converter():
"""Create a DocumentConverter configured for ASR with whisper_turbo model."""
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
converter = DocumentConverter(
format_options={
InputFormat.AUDIO: AudioFormatOption(
pipeline_cls=AsrPipeline,
pipeline_options=pipeline_options,
)
}
)
return converter
def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument:
"""ASR pipeline conversion using whisper_turbo"""
# Check if the test audio file exists
assert audio_path.exists(), f"Test audio file not found: {audio_path}"
converter = get_asr_converter()
# Convert the audio file
result: ConversionResult = converter.convert(audio_path)
# Verify conversion was successful
assert result.status == ConversionStatus.SUCCESS, (
f"Conversion failed with status: {result.status}"
)
return result.document
if __name__ == "__main__":
audio_path = Path("tests/data/audio/sample_10s.mp3")
doc = asr_pipeline_conversion(audio_path=audio_path)
print(doc.export_to_markdown())
# Expected output:
#
# [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde
#
# [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.

View File

@ -96,7 +96,8 @@ def watsonx_vlm_options():
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
pipeline_options = PdfPipelineOptions(
enable_remote_services=True # <-- this is required!

View File

@ -10,7 +10,8 @@ from docling.document_converter import DocumentConverter, PdfFormatOption
def main():
input_doc = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
# Explicitly set the accelerator
# accelerator_options = AcceleratorOptions(
@ -47,7 +48,7 @@ def main():
settings.debug.profile_pipeline_timings = True
# Convert the document
conversion_result = converter.convert(input_doc)
conversion_result = converter.convert(input_doc_path)
doc = conversion_result.document
# List with total time per document

View File

@ -9,7 +9,8 @@ from docling.document_converter import DocumentConverter, PdfFormatOption
def main():
input_doc = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
# Set lang=["auto"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions
# ocr_options = TesseractOcrOptions(lang=["auto"])
@ -27,7 +28,7 @@ def main():
}
)
doc = converter.convert(input_doc).document
doc = converter.convert(input_doc_path).document
md = doc.export_to_markdown()
print(md)

View File

@ -30,7 +30,8 @@ def translate(text: str, src: str = "en", dest: str = "de"):
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
output_dir = Path("scratch")
# Important: For operating with page images, we must keep them, otherwise the DocumentConverter

View File

@ -95,8 +95,8 @@ def watsonx_vlm_options(model: str, prompt: str):
def main():
logging.basicConfig(level=logging.INFO)
# input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
input_doc_path = Path("./tests/data/pdf/2305.03393v1-pg9.pdf")
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2305.03393v1-pg9.pdf"
pipeline_options = VlmPipelineOptions(
enable_remote_services=True # <-- this is required!

7
docs/index.md vendored
View File

@ -20,14 +20,15 @@ Docling simplifies document processing, parsing diverse formats — including ad
## Features
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 Extensive OCR support for scanned PDFs and images
* 🥚 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🔥
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models
* 💻 Simple and convenient CLI
### Coming soon

View File

@ -80,6 +80,7 @@ nav:
- "VLM pipeline with SmolDocling": examples/minimal_vlm_pipeline.py
- "VLM pipeline with remote model": examples/vlm_pipeline_api_model.py
- "VLM comparison": examples/compare_vlm_models.py
- "ASR pipeline with Whisper": examples/minimal_asr_pipeline.py
- "Figure export": examples/export_figures.py
- "Table export": examples/export_tables.py
- "Multimodal export": examples/export_multimodal.py

View File

@ -1,6 +1,6 @@
[project]
name = "docling"
version = "2.37.0" # DO NOT EDIT, updated automatically
version = "2.38.0" # DO NOT EDIT, updated automatically
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
license = "MIT"
keywords = [
@ -99,6 +99,9 @@ rapidocr = [
# 'onnxruntime (>=1.7.0,<2.0.0) ; python_version >= "3.10"',
# 'onnxruntime (>=1.7.0,<1.20.0) ; python_version < "3.10"',
]
asr = [
"openai-whisper>=20240930",
]
[dependency-groups]
dev = [
@ -145,6 +148,9 @@ constraints = [
package = true
default-groups = "all"
[tool.uv.sources]
openai-whisper = { git = "https://github.com/openai/whisper.git", rev = "dd985ac4b90cafeef8712f2998d62c59c3e62d22" }
[tool.setuptools.packages.find]
include = ["docling*"]

BIN
tests/data/audio/sample_10s.mp3 vendored Normal file

Binary file not shown.

BIN
tests/data/docx/word_image_anchors.docx vendored Normal file

Binary file not shown.

View File

@ -2705,7 +2705,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9373534917831421,
"confidence": 0.9373533725738525,
"cells": [
{
"index": 0,
@ -2745,7 +2745,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.8858680725097656,
"confidence": 0.8858679533004761,
"cells": [
{
"index": 1,
@ -13641,7 +13641,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9373534917831421,
"confidence": 0.9373533725738525,
"cells": [
{
"index": 0,
@ -13687,7 +13687,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.8858680725097656,
"confidence": 0.8858679533004761,
"cells": [
{
"index": 1,
@ -26499,7 +26499,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9373534917831421,
"confidence": 0.9373533725738525,
"cells": [
{
"index": 0,
@ -26545,7 +26545,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.8858680725097656,
"confidence": 0.8858679533004761,
"cells": [
{
"index": 1,

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "2203.01017v2",
"origin": {
"mimetype": "application/pdf",
@ -17863,7 +17863,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/1",
@ -18753,7 +18754,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/2",
@ -20117,7 +20119,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/3",
@ -22266,7 +22269,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/4",
@ -22927,7 +22931,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/5",
@ -24050,7 +24055,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/6",
@ -26307,7 +26313,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/7",
@ -27600,7 +27607,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/8",
@ -27635,7 +27643,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/9",
@ -27670,7 +27679,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/10",
@ -27705,7 +27715,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/11",
@ -27740,7 +27751,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/12",
@ -27783,7 +27795,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/13",
@ -27818,7 +27831,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/14",
@ -27853,7 +27867,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/15",
@ -27888,7 +27903,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/16",
@ -27931,7 +27947,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/17",
@ -27966,7 +27983,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/18",
@ -28001,7 +28019,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/19",
@ -28036,7 +28055,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/20",
@ -28071,7 +28091,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/21",
@ -28106,7 +28127,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/22",
@ -28141,7 +28163,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/23",
@ -28176,7 +28199,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/24",
@ -28211,7 +28235,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/25",
@ -28246,7 +28271,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/26",
@ -28281,7 +28307,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/27",
@ -28324,7 +28351,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/28",
@ -28359,7 +28387,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/29",
@ -28394,7 +28423,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/30",
@ -28429,7 +28459,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/31",
@ -28464,7 +28495,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/32",
@ -28499,7 +28531,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/33",
@ -28542,7 +28575,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/34",
@ -28577,7 +28611,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/35",
@ -28612,7 +28647,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/36",
@ -28647,7 +28683,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
},
{
"self_ref": "#/tables/37",
@ -28682,7 +28719,8 @@
"num_rows": 0,
"num_cols": 0,
"grid": []
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "2206.01062",
"origin": {
"mimetype": "application/pdf",
@ -23491,7 +23491,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/1",
@ -26654,7 +26655,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/2",
@ -29187,7 +29189,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/3",
@ -31574,7 +31577,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/4",
@ -34177,7 +34181,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "2305.03393v1-pg9",
"origin": {
"mimetype": "application/pdf",
@ -2104,7 +2104,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -2705,7 +2705,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9373534917831421,
"confidence": 0.9373533725738525,
"cells": [
{
"index": 0,
@ -2745,7 +2745,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.8858680725097656,
"confidence": 0.8858679533004761,
"cells": [
{
"index": 1,
@ -13641,7 +13641,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9373534917831421,
"confidence": 0.9373533725738525,
"cells": [
{
"index": 0,
@ -13687,7 +13687,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.8858680725097656,
"confidence": 0.8858679533004761,
"cells": [
{
"index": 1,
@ -26499,7 +26499,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9373534917831421,
"confidence": 0.9373533725738525,
"cells": [
{
"index": 0,
@ -26545,7 +26545,7 @@
"b": 102.78223000000003,
"coord_origin": "TOPLEFT"
},
"confidence": 0.8858680725097656,
"confidence": 0.8858679533004761,
"cells": [
{
"index": 1,

View File

@ -60,6 +60,8 @@
<page_header><loc_159><loc_59><loc_366><loc_64>Optimized Table Tokenization for Table Structure Recognition</page_header>
<page_header><loc_389><loc_59><loc_393><loc_64>7</page_header>
<picture><loc_135><loc_103><loc_367><loc_177><caption><loc_110><loc_79><loc_393><loc_98>Fig. 3. OTSL description of table structure: A - table example; B - graphical representation of table structure; C - mapping structure on a grid; D - OTSL structure encoding; E - explanation on cell encoding</caption></picture>
<unordered_list><list_item><loc_273><loc_172><loc_349><loc_176>4 - 2d merges: "C", "L", "U", "X"</list_item>
</unordered_list>
<section_header_level_1><loc_110><loc_193><loc_202><loc_198>4.2 Language Syntax</section_header_level_1>
<text><loc_110><loc_205><loc_297><loc_211>The OTSL representation follows these syntax rules:</text>
<unordered_list><list_item><loc_114><loc_219><loc_393><loc_232>1. Left-looking cell rule : The left neighbour of an "L" cell must be either another "L" cell or a "C" cell.</list_item>

File diff suppressed because it is too large Load Diff

View File

@ -84,6 +84,8 @@ Fig. 3. OTSL description of table structure: A - table example; B - graphical re
<!-- image -->
- 4 - 2d merges: "C", "L", "U", "X"
## 4.2 Language Syntax
The OTSL representation follows these syntax rules:

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "amt_handbook_sample",
"origin": {
"mimetype": "application/pdf",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "code_and_formula",
"origin": {
"mimetype": "application/pdf",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "csv-comma-in-cell",
"origin": {
"mimetype": "text/csv",
@ -538,7 +538,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "csv-comma",
"origin": {
"mimetype": "text/csv",
@ -1788,7 +1788,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "csv-inconsistent-header",
"origin": {
"mimetype": "text/csv",
@ -526,7 +526,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "csv-pipe",
"origin": {
"mimetype": "text/csv",
@ -1788,7 +1788,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "csv-semicolon",
"origin": {
"mimetype": "text/csv",
@ -1788,7 +1788,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "csv-tab",
"origin": {
"mimetype": "text/csv",
@ -1788,7 +1788,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "csv-too-few-columns",
"origin": {
"mimetype": "text/csv",
@ -526,7 +526,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "csv-too-many-columns",
"origin": {
"mimetype": "text/csv",
@ -610,7 +610,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "equations",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -250,7 +250,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -280,7 +281,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -322,7 +324,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -436,7 +439,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -466,7 +470,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -520,7 +525,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -634,7 +640,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "example_01",
"origin": {
"mimetype": "text/html",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "example_02",
"origin": {
"mimetype": "text/html",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "example_03",
"origin": {
"mimetype": "text/html",
@ -637,7 +637,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "example_04",
"origin": {
"mimetype": "text/html",
@ -325,7 +325,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "example_05",
"origin": {
"mimetype": "text/html",
@ -325,7 +325,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "example_06",
"origin": {
"mimetype": "text/html",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "example_07",
"origin": {
"mimetype": "text/html",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "example_08",
"origin": {
"mimetype": "text/html",
@ -661,7 +661,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/1",
@ -1330,7 +1331,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/2",
@ -1999,7 +2001,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -11,10 +11,12 @@ Create your feature branch: `git checkout -b feature/AmazingFeature` .
3. Commit your changes ( `git commit -m 'Add some AmazingFeature'` )
4. Push to the branch ( `git push origin feature/AmazingFeature` )
5. Open a Pull Request
6. **Whole list item has same formatting**
7. List item has *mixed or partial* formatting
##
*# Whole heading is italic*
*Second* section
&lt;&lt;&lt;&lt;&lt;&lt;&lt; HEAD
- **First** : Lorem ipsum.
- **Second** : Dolor `sit` amet.
@ -22,3 +24,13 @@ Create your feature branch: `git checkout -b feature/AmazingFeature` .
| Bold Heading | Italic Heading |
|----------------|------------------|
| data a | data b |
Some *`formatted_code`*
##
*Partially formatted* heading to\_escape `not_to_escape`
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
origin/main

View File

@ -5,10 +5,14 @@ body:
- $ref: '#/groups/0'
- $ref: '#/groups/1'
- $ref: '#/groups/2'
- $ref: '#/texts/27'
- $ref: '#/texts/32'
- $ref: '#/texts/33'
- $ref: '#/groups/8'
- $ref: '#/groups/11'
- $ref: '#/tables/0'
- $ref: '#/groups/11'
- $ref: '#/texts/44'
- $ref: '#/texts/48'
- $ref: '#/texts/49'
content_layer: body
label: unspecified
name: _root_
@ -49,6 +53,8 @@ groups:
- $ref: '#/texts/18'
- $ref: '#/texts/22'
- $ref: '#/texts/26'
- $ref: '#/texts/27'
- $ref: '#/texts/28'
content_layer: body
label: ordered_list
name: list
@ -96,17 +102,18 @@ groups:
$ref: '#/texts/22'
self_ref: '#/groups/6'
- children:
- $ref: '#/texts/28'
- $ref: '#/texts/29'
- $ref: '#/texts/30'
- $ref: '#/texts/31'
content_layer: body
label: inline
name: group
parent:
$ref: '#/texts/27'
$ref: '#/texts/28'
self_ref: '#/groups/7'
- children:
- $ref: '#/texts/30'
- $ref: '#/texts/33'
- $ref: '#/texts/34'
- $ref: '#/texts/37'
content_layer: body
label: list
name: list
@ -114,36 +121,48 @@ groups:
$ref: '#/body'
self_ref: '#/groups/8'
- children:
- $ref: '#/texts/31'
- $ref: '#/texts/32'
content_layer: body
label: inline
name: group
parent:
$ref: '#/texts/30'
self_ref: '#/groups/9'
- children:
- $ref: '#/texts/34'
- $ref: '#/texts/35'
- $ref: '#/texts/36'
- $ref: '#/texts/37'
content_layer: body
label: inline
name: group
parent:
$ref: '#/texts/33'
$ref: '#/texts/34'
self_ref: '#/groups/9'
- children:
- $ref: '#/texts/38'
- $ref: '#/texts/39'
- $ref: '#/texts/40'
- $ref: '#/texts/41'
content_layer: body
label: inline
name: group
parent:
$ref: '#/texts/37'
self_ref: '#/groups/10'
- children: []
- children:
- $ref: '#/texts/42'
- $ref: '#/texts/43'
content_layer: body
label: inline
name: group
parent:
$ref: '#/body'
self_ref: '#/groups/11'
- children:
- $ref: '#/texts/45'
- $ref: '#/texts/46'
- $ref: '#/texts/47'
content_layer: body
label: inline
name: group
parent:
$ref: '#/texts/44'
self_ref: '#/groups/12'
key_value_items: []
name: inline_and_formatting
origin:
binary_hash: 15980020574215496313
binary_hash: 1036526097556828366
filename: inline_and_formatting.md
mimetype: text/markdown
pages: {}
@ -613,18 +632,47 @@ texts:
self_ref: '#/texts/26'
text: Open a Pull Request
word_items_ids: []
- children: []
content_layer: body
enumerated: true
formatting:
bold: true
italic: false
script: baseline
strikethrough: false
underline: false
label: list_item
marker: '-'
orig: Whole list item has same formatting
parent:
$ref: '#/groups/2'
prov: []
self_ref: '#/texts/27'
text: Whole list item has same formatting
word_items_ids: []
- children:
- $ref: '#/groups/7'
content_layer: body
label: section_header
level: 1
enumerated: true
label: list_item
marker: '-'
orig: ''
parent:
$ref: '#/body'
$ref: '#/groups/2'
prov: []
self_ref: '#/texts/27'
self_ref: '#/texts/28'
text: ''
word_items_ids: []
- children: []
content_layer: body
label: text
orig: List item has
parent:
$ref: '#/groups/7'
prov: []
self_ref: '#/texts/29'
text: List item has
word_items_ids: []
- children: []
content_layer: body
formatting:
@ -634,22 +682,48 @@ texts:
strikethrough: false
underline: false
label: text
orig: Second
orig: mixed or partial
parent:
$ref: '#/groups/7'
prov: []
self_ref: '#/texts/28'
text: Second
self_ref: '#/texts/30'
text: mixed or partial
word_items_ids: []
- children: []
content_layer: body
label: text
orig: section
orig: formatting
parent:
$ref: '#/groups/7'
prov: []
self_ref: '#/texts/29'
text: section
self_ref: '#/texts/31'
text: formatting
word_items_ids: []
- children: []
content_layer: body
formatting:
bold: false
italic: true
script: baseline
strikethrough: false
underline: false
label: title
orig: Whole heading is italic
parent:
$ref: '#/body'
prov: []
self_ref: '#/texts/32'
text: Whole heading is italic
word_items_ids: []
- children: []
content_layer: body
label: text
orig: <<<<<<< HEAD
parent:
$ref: '#/body'
prov: []
self_ref: '#/texts/33'
text: <<<<<<< HEAD
word_items_ids: []
- children:
- $ref: '#/groups/9'
@ -661,7 +735,7 @@ texts:
parent:
$ref: '#/groups/8'
prov: []
self_ref: '#/texts/30'
self_ref: '#/texts/34'
text: ''
word_items_ids: []
- children: []
@ -677,7 +751,7 @@ texts:
parent:
$ref: '#/groups/9'
prov: []
self_ref: '#/texts/31'
self_ref: '#/texts/35'
text: First
word_items_ids: []
- children: []
@ -687,7 +761,7 @@ texts:
parent:
$ref: '#/groups/9'
prov: []
self_ref: '#/texts/32'
self_ref: '#/texts/36'
text: ': Lorem ipsum.'
word_items_ids: []
- children:
@ -700,7 +774,7 @@ texts:
parent:
$ref: '#/groups/8'
prov: []
self_ref: '#/texts/33'
self_ref: '#/texts/37'
text: ''
word_items_ids: []
- children: []
@ -716,7 +790,7 @@ texts:
parent:
$ref: '#/groups/10'
prov: []
self_ref: '#/texts/34'
self_ref: '#/texts/38'
text: Second
word_items_ids: []
- children: []
@ -726,7 +800,7 @@ texts:
parent:
$ref: '#/groups/10'
prov: []
self_ref: '#/texts/35'
self_ref: '#/texts/39'
text: ': Dolor'
word_items_ids: []
- captions: []
@ -740,7 +814,7 @@ texts:
$ref: '#/groups/10'
prov: []
references: []
self_ref: '#/texts/36'
self_ref: '#/texts/40'
text: sit
word_items_ids: []
- children: []
@ -750,7 +824,110 @@ texts:
parent:
$ref: '#/groups/10'
prov: []
self_ref: '#/texts/37'
self_ref: '#/texts/41'
text: amet.
word_items_ids: []
- children: []
content_layer: body
label: text
orig: Some
parent:
$ref: '#/groups/11'
prov: []
self_ref: '#/texts/42'
text: Some
word_items_ids: []
- captions: []
children: []
code_language: unknown
content_layer: body
footnotes: []
formatting:
bold: false
italic: true
script: baseline
strikethrough: false
underline: false
label: code
orig: formatted_code
parent:
$ref: '#/groups/11'
prov: []
references: []
self_ref: '#/texts/43'
text: formatted_code
word_items_ids: []
- children:
- $ref: '#/groups/12'
content_layer: body
label: section_header
level: 1
orig: ''
parent:
$ref: '#/body'
prov: []
self_ref: '#/texts/44'
text: ''
word_items_ids: []
- children: []
content_layer: body
formatting:
bold: false
italic: true
script: baseline
strikethrough: false
underline: false
label: text
orig: Partially formatted
parent:
$ref: '#/groups/12'
prov: []
self_ref: '#/texts/45'
text: Partially formatted
word_items_ids: []
- children: []
content_layer: body
label: text
orig: heading to_escape
parent:
$ref: '#/groups/12'
prov: []
self_ref: '#/texts/46'
text: heading to_escape
word_items_ids: []
- captions: []
children: []
code_language: unknown
content_layer: body
footnotes: []
label: code
orig: not_to_escape
parent:
$ref: '#/groups/12'
prov: []
references: []
self_ref: '#/texts/47'
text: not_to_escape
word_items_ids: []
- children: []
content_layer: body
hyperlink: https://en.wikipedia.org/wiki/Albert_Einstein
label: text
orig: $$E=mc^2$$
parent:
$ref: '#/body'
prov: []
self_ref: '#/texts/48'
text: $$E=mc^2$$
word_items_ids: []
- children: []
content_layer: body
label: text
orig: origin/main
parent:
$ref: '#/body'
prov: []
self_ref: '#/texts/49'
text: origin/main
word_items_ids: []
version: 1.4.0

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "ipa20180000016.xml",
"origin": {
"mimetype": "application/xml",
@ -6005,7 +6005,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "ipa20200022300.xml",
"origin": {
"mimetype": "application/xml",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "lorem_ipsum",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -66,7 +66,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -96,7 +97,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -126,7 +128,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -156,7 +159,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -186,7 +190,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
}
],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "multi_page",
"origin": {
"mimetype": "application/pdf",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "pa20010031492.xml",
"origin": {
"mimetype": "application/xml",
@ -2127,7 +2127,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "pftaps057006474.txt",
"origin": {
"mimetype": "text/plain",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "pg06442728.xml",
"origin": {
"mimetype": "application/xml",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "picture_classification",
"origin": {
"mimetype": "application/pdf",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "powerpoint_bad_text",
"origin": {
"mimetype": "application/vnd.ms-powerpoint",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "powerpoint_sample",
"origin": {
"mimetype": "application/vnd.ms-powerpoint",
@ -2199,7 +2199,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "powerpoint_with_image",
"origin": {
"mimetype": "application/vnd.ms-powerpoint",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "redp5110_sampled",
"origin": {
"mimetype": "application/pdf",
@ -12471,7 +12471,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/1",
@ -13096,7 +13097,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/2",
@ -15356,7 +15358,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/3",
@ -15713,7 +15716,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/4",
@ -16918,7 +16922,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "right_to_left_01",
"origin": {
"mimetype": "application/pdf",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "right_to_left_02",
"origin": {
"mimetype": "application/pdf",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "right_to_left_03",
"origin": {
"mimetype": "application/pdf",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "sample_sales_data",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
@ -2136,7 +2136,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "tablecell",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -78,7 +78,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -98,7 +99,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -130,7 +132,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -172,7 +175,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
}
],
@ -419,7 +423,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "test-01",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
@ -681,7 +681,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/1",
@ -1599,7 +1600,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/2",
@ -2005,7 +2007,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/3",
@ -2411,7 +2414,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/4",
@ -2893,7 +2897,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/5",
@ -3375,7 +3380,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "test_emf_docx",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -60,7 +60,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -78,7 +79,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -96,7 +98,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -114,7 +117,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
}
],

View File

@ -11,83 +11,82 @@ item-0 at level 0: unspecified: group _root_
Blisters
Headache
Sore throat
item-9 at level 1: list_item:
item-9 at level 1: paragraph:
item-10 at level 1: paragraph:
item-11 at level 1: paragraph:
item-12 at level 1: section: group textbox
item-13 at level 2: paragraph: If a caregiver suspects that wit ... the same suggested reportable symptoms
item-11 at level 1: section: group textbox
item-12 at level 2: paragraph: If a caregiver suspects that wit ... the same suggested reportable symptoms
item-13 at level 1: paragraph:
item-14 at level 1: paragraph:
item-15 at level 1: paragraph:
item-16 at level 1: paragraph:
item-17 at level 1: paragraph:
item-18 at level 1: section: group textbox
item-19 at level 2: paragraph: Yes
item-17 at level 1: section: group textbox
item-18 at level 2: paragraph: Yes
item-19 at level 1: paragraph:
item-20 at level 1: paragraph:
item-21 at level 1: paragraph:
item-22 at level 1: section: group textbox
item-23 at level 2: list: group list
item-24 at level 3: list_item: A report must be submitted withi ... saster Prevention Information Network.
item-25 at level 3: list_item: A report must also be submitted ... d Infectious Disease Reporting System.
item-26 at level 2: paragraph:
item-27 at level 1: list: group list
item-28 at level 2: list_item:
item-21 at level 1: section: group textbox
item-22 at level 2: list: group list
item-23 at level 3: list_item: A report must be submitted withi ... saster Prevention Information Network.
item-24 at level 3: list_item: A report must also be submitted ... d Infectious Disease Reporting System.
item-25 at level 2: paragraph:
item-26 at level 1: list: group list
item-27 at level 2: list_item:
item-28 at level 1: paragraph:
item-29 at level 1: paragraph:
item-30 at level 1: paragraph:
item-31 at level 1: paragraph:
item-32 at level 1: paragraph:
item-33 at level 1: paragraph:
item-34 at level 1: section: group textbox
item-35 at level 2: paragraph: Health Bureau:
item-36 at level 2: paragraph: Upon receiving a report from the ... rt to the Centers for Disease Control.
item-37 at level 2: list: group list
item-38 at level 3: list_item: If necessary, provide health edu ... vidual to undergo specimen collection.
item-39 at level 3: list_item: Implement appropriate epidemic p ... the Communicable Disease Control Act.
item-40 at level 2: paragraph:
item-41 at level 1: list: group list
item-42 at level 2: list_item:
item-43 at level 1: paragraph:
item-44 at level 1: section: group textbox
item-45 at level 2: paragraph: Department of Education:
item-33 at level 1: section: group textbox
item-34 at level 2: paragraph: Health Bureau:
item-35 at level 2: paragraph: Upon receiving a report from the ... rt to the Centers for Disease Control.
item-36 at level 2: list: group list
item-37 at level 3: list_item: If necessary, provide health edu ... vidual to undergo specimen collection.
item-38 at level 3: list_item: Implement appropriate epidemic p ... the Communicable Disease Control Act.
item-39 at level 2: paragraph:
item-40 at level 1: list: group list
item-41 at level 2: list_item:
item-42 at level 1: paragraph:
item-43 at level 1: section: group textbox
item-44 at level 2: paragraph: Department of Education:
Collabo ... vention measures at all school levels.
item-45 at level 1: paragraph:
item-46 at level 1: paragraph:
item-47 at level 1: paragraph:
item-48 at level 1: paragraph:
item-49 at level 1: paragraph:
item-50 at level 1: paragraph:
item-51 at level 1: paragraph:
item-52 at level 1: paragraph:
item-53 at level 1: section: group textbox
item-54 at level 2: inline: group group
item-55 at level 3: paragraph: The Health Bureau will handle
item-56 at level 3: paragraph: reporting and specimen collection
item-57 at level 3: paragraph: .
item-58 at level 2: paragraph:
item-52 at level 1: section: group textbox
item-53 at level 2: inline: group group
item-54 at level 3: paragraph: The Health Bureau will handle
item-55 at level 3: paragraph: reporting and specimen collection
item-56 at level 3: paragraph: .
item-57 at level 2: paragraph:
item-58 at level 1: paragraph:
item-59 at level 1: paragraph:
item-60 at level 1: paragraph:
item-61 at level 1: paragraph:
item-62 at level 1: section: group textbox
item-63 at level 2: paragraph: Whether the epidemic has eased.
item-64 at level 2: paragraph:
item-65 at level 1: paragraph:
item-66 at level 1: section: group textbox
item-67 at level 2: paragraph: Whether the test results are pos ... legally designated infectious disease.
item-68 at level 2: paragraph: No
item-61 at level 1: section: group textbox
item-62 at level 2: paragraph: Whether the epidemic has eased.
item-63 at level 2: paragraph:
item-64 at level 1: paragraph:
item-65 at level 1: section: group textbox
item-66 at level 2: paragraph: Whether the test results are pos ... legally designated infectious disease.
item-67 at level 2: paragraph: No
item-68 at level 1: paragraph:
item-69 at level 1: paragraph:
item-70 at level 1: paragraph:
item-71 at level 1: section: group textbox
item-72 at level 2: paragraph: Yes
item-73 at level 1: paragraph:
item-74 at level 1: section: group textbox
item-75 at level 2: paragraph: Yes
item-70 at level 1: section: group textbox
item-71 at level 2: paragraph: Yes
item-72 at level 1: paragraph:
item-73 at level 1: section: group textbox
item-74 at level 2: paragraph: Yes
item-75 at level 1: paragraph:
item-76 at level 1: paragraph:
item-77 at level 1: paragraph:
item-78 at level 1: section: group textbox
item-79 at level 2: paragraph: Case closed.
item-80 at level 2: paragraph:
item-81 at level 2: paragraph: The Health Bureau will carry out ... ters for Disease Control if necessary.
item-82 at level 1: paragraph:
item-83 at level 1: section: group textbox
item-84 at level 2: paragraph: No
item-77 at level 1: section: group textbox
item-78 at level 2: paragraph: Case closed.
item-79 at level 2: paragraph:
item-80 at level 2: paragraph: The Health Bureau will carry out ... ters for Disease Control if necessary.
item-81 at level 1: paragraph:
item-82 at level 1: section: group textbox
item-83 at level 2: paragraph: No
item-84 at level 1: paragraph:
item-85 at level 1: paragraph:
item-86 at level 1: paragraph:
item-87 at level 1: paragraph:

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "textbox",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -36,10 +36,10 @@
"$ref": "#/texts/7"
},
{
"$ref": "#/texts/8"
"$ref": "#/groups/2"
},
{
"$ref": "#/groups/2"
"$ref": "#/texts/9"
},
{
"$ref": "#/texts/10"
@ -50,17 +50,14 @@
{
"$ref": "#/texts/12"
},
{
"$ref": "#/texts/13"
},
{
"$ref": "#/groups/3"
},
{
"$ref": "#/texts/15"
"$ref": "#/texts/14"
},
{
"$ref": "#/texts/16"
"$ref": "#/texts/15"
},
{
"$ref": "#/groups/4"
@ -68,6 +65,9 @@
{
"$ref": "#/groups/6"
},
{
"$ref": "#/texts/20"
},
{
"$ref": "#/texts/21"
},
@ -80,9 +80,6 @@
{
"$ref": "#/texts/24"
},
{
"$ref": "#/texts/25"
},
{
"$ref": "#/groups/7"
},
@ -90,11 +87,14 @@
"$ref": "#/groups/9"
},
{
"$ref": "#/texts/32"
"$ref": "#/texts/31"
},
{
"$ref": "#/groups/10"
},
{
"$ref": "#/texts/33"
},
{
"$ref": "#/texts/34"
},
@ -114,10 +114,10 @@
"$ref": "#/texts/39"
},
{
"$ref": "#/texts/40"
"$ref": "#/groups/11"
},
{
"$ref": "#/groups/11"
"$ref": "#/texts/44"
},
{
"$ref": "#/texts/45"
@ -125,56 +125,53 @@
{
"$ref": "#/texts/46"
},
{
"$ref": "#/texts/47"
},
{
"$ref": "#/groups/13"
},
{
"$ref": "#/texts/50"
"$ref": "#/texts/49"
},
{
"$ref": "#/groups/14"
},
{
"$ref": "#/texts/53"
"$ref": "#/texts/52"
},
{
"$ref": "#/texts/54"
"$ref": "#/texts/53"
},
{
"$ref": "#/groups/15"
},
{
"$ref": "#/texts/56"
"$ref": "#/texts/55"
},
{
"$ref": "#/groups/16"
},
{
"$ref": "#/texts/58"
"$ref": "#/texts/57"
},
{
"$ref": "#/texts/59"
"$ref": "#/texts/58"
},
{
"$ref": "#/groups/17"
},
{
"$ref": "#/texts/63"
"$ref": "#/texts/62"
},
{
"$ref": "#/groups/18"
},
{
"$ref": "#/texts/64"
},
{
"$ref": "#/texts/65"
},
{
"$ref": "#/texts/66"
},
{
"$ref": "#/texts/67"
}
],
"content_layer": "body",
@ -223,7 +220,7 @@
},
"children": [
{
"$ref": "#/texts/9"
"$ref": "#/texts/8"
}
],
"content_layer": "body",
@ -237,7 +234,7 @@
},
"children": [
{
"$ref": "#/texts/14"
"$ref": "#/texts/13"
}
],
"content_layer": "body",
@ -254,7 +251,7 @@
"$ref": "#/groups/5"
},
{
"$ref": "#/texts/19"
"$ref": "#/texts/18"
}
],
"content_layer": "body",
@ -268,10 +265,10 @@
},
"children": [
{
"$ref": "#/texts/17"
"$ref": "#/texts/16"
},
{
"$ref": "#/texts/18"
"$ref": "#/texts/17"
}
],
"content_layer": "body",
@ -285,7 +282,7 @@
},
"children": [
{
"$ref": "#/texts/20"
"$ref": "#/texts/19"
}
],
"content_layer": "body",
@ -299,16 +296,16 @@
},
"children": [
{
"$ref": "#/texts/26"
"$ref": "#/texts/25"
},
{
"$ref": "#/texts/27"
"$ref": "#/texts/26"
},
{
"$ref": "#/groups/8"
},
{
"$ref": "#/texts/30"
"$ref": "#/texts/29"
}
],
"content_layer": "body",
@ -322,10 +319,10 @@
},
"children": [
{
"$ref": "#/texts/28"
"$ref": "#/texts/27"
},
{
"$ref": "#/texts/29"
"$ref": "#/texts/28"
}
],
"content_layer": "body",
@ -339,7 +336,7 @@
},
"children": [
{
"$ref": "#/texts/31"
"$ref": "#/texts/30"
}
],
"content_layer": "body",
@ -353,7 +350,7 @@
},
"children": [
{
"$ref": "#/texts/33"
"$ref": "#/texts/32"
}
],
"content_layer": "body",
@ -370,7 +367,7 @@
"$ref": "#/groups/12"
},
{
"$ref": "#/texts/44"
"$ref": "#/texts/43"
}
],
"content_layer": "body",
@ -383,14 +380,14 @@
"$ref": "#/groups/11"
},
"children": [
{
"$ref": "#/texts/40"
},
{
"$ref": "#/texts/41"
},
{
"$ref": "#/texts/42"
},
{
"$ref": "#/texts/43"
}
],
"content_layer": "body",
@ -404,10 +401,10 @@
},
"children": [
{
"$ref": "#/texts/48"
"$ref": "#/texts/47"
},
{
"$ref": "#/texts/49"
"$ref": "#/texts/48"
}
],
"content_layer": "body",
@ -421,10 +418,10 @@
},
"children": [
{
"$ref": "#/texts/51"
"$ref": "#/texts/50"
},
{
"$ref": "#/texts/52"
"$ref": "#/texts/51"
}
],
"content_layer": "body",
@ -438,7 +435,7 @@
},
"children": [
{
"$ref": "#/texts/55"
"$ref": "#/texts/54"
}
],
"content_layer": "body",
@ -452,7 +449,7 @@
},
"children": [
{
"$ref": "#/texts/57"
"$ref": "#/texts/56"
}
],
"content_layer": "body",
@ -465,14 +462,14 @@
"$ref": "#/body"
},
"children": [
{
"$ref": "#/texts/59"
},
{
"$ref": "#/texts/60"
},
{
"$ref": "#/texts/61"
},
{
"$ref": "#/texts/62"
}
],
"content_layer": "body",
@ -486,7 +483,7 @@
},
"children": [
{
"$ref": "#/texts/64"
"$ref": "#/texts/63"
}
],
"content_layer": "body",
@ -510,7 +507,8 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -528,7 +526,8 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -558,7 +557,8 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -588,7 +588,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -600,12 +601,10 @@
},
"children": [],
"content_layer": "body",
"label": "list_item",
"label": "paragraph",
"prov": [],
"orig": "",
"text": "",
"enumerated": false,
"marker": "-"
"text": ""
},
{
"self_ref": "#/texts/7",
@ -621,18 +620,6 @@
},
{
"self_ref": "#/texts/8",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/9",
"parent": {
"$ref": "#/groups/2"
},
@ -646,9 +633,22 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/9",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/10",
"parent": {
@ -687,18 +687,6 @@
},
{
"self_ref": "#/texts/13",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/14",
"parent": {
"$ref": "#/groups/3"
},
@ -712,9 +700,22 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/14",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/15",
"parent": {
@ -729,18 +730,6 @@
},
{
"self_ref": "#/texts/16",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/17",
"parent": {
"$ref": "#/groups/5"
},
@ -754,13 +743,14 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/18",
"self_ref": "#/texts/17",
"parent": {
"$ref": "#/groups/5"
},
@ -774,13 +764,14 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/19",
"self_ref": "#/texts/18",
"parent": {
"$ref": "#/groups/4"
},
@ -792,7 +783,7 @@
"text": ""
},
{
"self_ref": "#/texts/20",
"self_ref": "#/texts/19",
"parent": {
"$ref": "#/groups/6"
},
@ -805,6 +796,18 @@
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/20",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/21",
"parent": {
@ -855,18 +858,6 @@
},
{
"self_ref": "#/texts/25",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/26",
"parent": {
"$ref": "#/groups/7"
},
@ -880,11 +871,12 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/27",
"self_ref": "#/texts/26",
"parent": {
"$ref": "#/groups/7"
},
@ -898,11 +890,12 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/28",
"self_ref": "#/texts/27",
"parent": {
"$ref": "#/groups/8"
},
@ -916,13 +909,14 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/29",
"self_ref": "#/texts/28",
"parent": {
"$ref": "#/groups/8"
},
@ -936,13 +930,14 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/30",
"self_ref": "#/texts/29",
"parent": {
"$ref": "#/groups/7"
},
@ -954,7 +949,7 @@
"text": ""
},
{
"self_ref": "#/texts/31",
"self_ref": "#/texts/30",
"parent": {
"$ref": "#/groups/9"
},
@ -968,7 +963,7 @@
"marker": "-"
},
{
"self_ref": "#/texts/32",
"self_ref": "#/texts/31",
"parent": {
"$ref": "#/body"
},
@ -980,7 +975,7 @@
"text": ""
},
{
"self_ref": "#/texts/33",
"self_ref": "#/texts/32",
"parent": {
"$ref": "#/groups/10"
},
@ -994,9 +989,22 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/33",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/34",
"parent": {
@ -1071,18 +1079,6 @@
},
{
"self_ref": "#/texts/40",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/41",
"parent": {
"$ref": "#/groups/12"
},
@ -1096,11 +1092,12 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/42",
"self_ref": "#/texts/41",
"parent": {
"$ref": "#/groups/12"
},
@ -1114,11 +1111,12 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/43",
"self_ref": "#/texts/42",
"parent": {
"$ref": "#/groups/12"
},
@ -1132,13 +1130,26 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/43",
"parent": {
"$ref": "#/groups/11"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/44",
"parent": {
"$ref": "#/groups/11"
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
@ -1173,18 +1184,6 @@
},
{
"self_ref": "#/texts/47",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/48",
"parent": {
"$ref": "#/groups/13"
},
@ -1198,11 +1197,12 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/49",
"self_ref": "#/texts/48",
"parent": {
"$ref": "#/groups/13"
},
@ -1214,7 +1214,7 @@
"text": ""
},
{
"self_ref": "#/texts/50",
"self_ref": "#/texts/49",
"parent": {
"$ref": "#/body"
},
@ -1226,7 +1226,7 @@
"text": ""
},
{
"self_ref": "#/texts/51",
"self_ref": "#/texts/50",
"parent": {
"$ref": "#/groups/14"
},
@ -1240,11 +1240,12 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/52",
"self_ref": "#/texts/51",
"parent": {
"$ref": "#/groups/14"
},
@ -1258,9 +1259,22 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/52",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/53",
"parent": {
@ -1275,18 +1289,6 @@
},
{
"self_ref": "#/texts/54",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/55",
"parent": {
"$ref": "#/groups/15"
},
@ -1300,11 +1302,12 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/56",
"self_ref": "#/texts/55",
"parent": {
"$ref": "#/body"
},
@ -1316,7 +1319,7 @@
"text": ""
},
{
"self_ref": "#/texts/57",
"self_ref": "#/texts/56",
"parent": {
"$ref": "#/groups/16"
},
@ -1330,9 +1333,22 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/57",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/58",
"parent": {
@ -1347,18 +1363,6 @@
},
{
"self_ref": "#/texts/59",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/60",
"parent": {
"$ref": "#/groups/17"
},
@ -1372,11 +1376,12 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/61",
"self_ref": "#/texts/60",
"parent": {
"$ref": "#/groups/17"
},
@ -1388,7 +1393,7 @@
"text": ""
},
{
"self_ref": "#/texts/62",
"self_ref": "#/texts/61",
"parent": {
"$ref": "#/groups/17"
},
@ -1402,11 +1407,12 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/63",
"self_ref": "#/texts/62",
"parent": {
"$ref": "#/body"
},
@ -1418,7 +1424,7 @@
"text": ""
},
{
"self_ref": "#/texts/64",
"self_ref": "#/texts/63",
"parent": {
"$ref": "#/groups/18"
},
@ -1432,9 +1438,22 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/64",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/65",
"parent": {
@ -1458,18 +1477,6 @@
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/67",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
}
],
"pictures": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "unit_test_01",
"origin": {
"mimetype": "text/html",

View File

@ -17,14 +17,16 @@ item-0 at level 0: unspecified: group _root_
item-16 at level 2: list_item: Italic bullet 1
item-17 at level 2: list_item: Bold bullet 2
item-18 at level 2: list_item: Underline bullet 3
item-19 at level 2: inline: group group
item-20 at level 3: list_item: Some
item-21 at level 3: list_item: italic
item-22 at level 3: list_item: bold
item-23 at level 3: list_item: underline
item-24 at level 2: list: group list
item-25 at level 3: inline: group group
item-26 at level 4: list_item: Nested
item-27 at level 4: list_item: italic
item-28 at level 4: list_item: bold
item-29 at level 1: paragraph:
item-19 at level 2: list_item:
item-20 at level 3: inline: group group
item-21 at level 4: text: Some
item-22 at level 4: text: italic
item-23 at level 4: text: bold
item-24 at level 4: text: underline
item-25 at level 2: list: group list
item-26 at level 3: list_item:
item-27 at level 4: inline: group group
item-28 at level 5: text: Nested
item-29 at level 5: text: italic
item-30 at level 5: text: bold
item-31 at level 1: paragraph:

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "unit_test_formatting",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -42,7 +42,7 @@
"$ref": "#/groups/1"
},
{
"$ref": "#/texts/23"
"$ref": "#/texts/25"
}
],
"content_layer": "body",
@ -98,7 +98,7 @@
"$ref": "#/texts/15"
},
{
"$ref": "#/groups/2"
"$ref": "#/texts/16"
},
{
"$ref": "#/groups/3"
@ -111,12 +111,9 @@
{
"self_ref": "#/groups/2",
"parent": {
"$ref": "#/groups/1"
},
"children": [
{
"$ref": "#/texts/16"
},
"children": [
{
"$ref": "#/texts/17"
},
@ -125,6 +122,9 @@
},
{
"$ref": "#/texts/19"
},
{
"$ref": "#/texts/20"
}
],
"content_layer": "body",
@ -138,7 +138,7 @@
},
"children": [
{
"$ref": "#/groups/4"
"$ref": "#/texts/21"
}
],
"content_layer": "body",
@ -148,17 +148,17 @@
{
"self_ref": "#/groups/4",
"parent": {
"$ref": "#/groups/3"
"$ref": "#/texts/21"
},
"children": [
{
"$ref": "#/texts/20"
},
{
"$ref": "#/texts/21"
},
{
"$ref": "#/texts/22"
},
{
"$ref": "#/texts/23"
},
{
"$ref": "#/texts/24"
}
],
"content_layer": "body",
@ -182,7 +182,8 @@
"bold": false,
"italic": true,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -200,7 +201,8 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -218,7 +220,8 @@
"bold": false,
"italic": false,
"underline": true,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -236,7 +239,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"hyperlink": "https:/github.com/DS4SD/docling"
},
@ -255,7 +259,8 @@
"bold": true,
"italic": true,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"hyperlink": "https:/github.com/DS4SD/docling"
},
@ -274,7 +279,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -292,7 +298,8 @@
"bold": false,
"italic": true,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -310,7 +317,8 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -328,7 +336,8 @@
"bold": false,
"italic": false,
"underline": true,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -346,7 +355,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -364,7 +374,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"hyperlink": "https:/github.com/DS4SD/docling"
},
@ -383,7 +394,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -413,7 +425,8 @@
"bold": false,
"italic": true,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -433,7 +446,8 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -453,7 +467,8 @@
"bold": false,
"italic": false,
"underline": true,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -461,20 +476,18 @@
{
"self_ref": "#/texts/16",
"parent": {
"$ref": "#/groups/2"
"$ref": "#/groups/1"
},
"children": [],
"children": [
{
"$ref": "#/groups/2"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Some",
"text": "Some",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
},
"orig": "",
"text": "",
"enumerated": false,
"marker": "-"
},
@ -485,18 +498,17 @@
},
"children": [],
"content_layer": "body",
"label": "list_item",
"label": "text",
"prov": [],
"orig": "italic",
"text": "italic",
"orig": "Some",
"text": "Some",
"formatting": {
"bold": false,
"italic": true,
"italic": false,
"underline": false,
"strikethrough": false
},
"enumerated": false,
"marker": "-"
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/18",
@ -505,18 +517,17 @@
},
"children": [],
"content_layer": "body",
"label": "list_item",
"label": "text",
"prov": [],
"orig": "bold",
"text": "bold",
"orig": "italic",
"text": "italic",
"formatting": {
"bold": true,
"italic": false,
"bold": false,
"italic": true,
"underline": false,
"strikethrough": false
},
"enumerated": false,
"marker": "-"
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/19",
@ -525,7 +536,26 @@
},
"children": [],
"content_layer": "body",
"label": "list_item",
"label": "text",
"prov": [],
"orig": "bold",
"text": "bold",
"formatting": {
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/20",
"parent": {
"$ref": "#/groups/2"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "underline",
"text": "underline",
@ -533,48 +563,25 @@
"bold": false,
"italic": false,
"underline": true,
"strikethrough": false
},
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/20",
"parent": {
"$ref": "#/groups/4"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Nested",
"text": "Nested",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
},
"enumerated": false,
"marker": "-"
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/21",
"parent": {
"$ref": "#/groups/4"
"$ref": "#/groups/3"
},
"children": [],
"children": [
{
"$ref": "#/groups/4"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "italic",
"text": "italic",
"formatting": {
"bold": false,
"italic": true,
"underline": false,
"strikethrough": false
},
"orig": "",
"text": "",
"enumerated": false,
"marker": "-"
},
@ -585,7 +592,45 @@
},
"children": [],
"content_layer": "body",
"label": "list_item",
"label": "text",
"prov": [],
"orig": "Nested",
"text": "Nested",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/23",
"parent": {
"$ref": "#/groups/4"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "italic",
"text": "italic",
"formatting": {
"bold": false,
"italic": true,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/24",
"parent": {
"$ref": "#/groups/4"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "bold",
"text": "bold",
@ -593,13 +638,12 @@
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false
},
"enumerated": false,
"marker": "-"
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/23",
"self_ref": "#/texts/25",
"parent": {
"$ref": "#/body"
},

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "unit_test_headers",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -138,7 +138,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -168,7 +169,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -239,7 +241,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -269,7 +272,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -343,7 +347,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -373,7 +378,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -447,7 +453,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -477,7 +484,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -566,7 +574,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -596,7 +605,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -667,7 +677,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -697,7 +708,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -771,7 +783,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -801,7 +814,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "unit_test_headers_numbered",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -214,7 +214,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -244,7 +245,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -315,7 +317,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -345,7 +348,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -419,7 +423,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -449,7 +454,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -523,7 +529,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -553,7 +560,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -620,7 +628,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -650,7 +659,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -721,7 +731,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -751,7 +762,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -825,7 +837,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -855,7 +868,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "unit_test_lists",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -370,7 +370,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -400,7 +401,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -450,7 +452,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -470,7 +473,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -490,7 +494,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -542,7 +547,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -562,7 +568,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -582,7 +589,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -634,7 +642,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -654,7 +663,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -674,7 +684,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -694,7 +705,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -714,7 +726,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -734,7 +747,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -786,7 +800,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -806,7 +821,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -826,7 +842,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -878,7 +895,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -898,7 +916,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -918,7 +937,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -938,7 +958,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -996,7 +1017,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -1016,7 +1038,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -1036,7 +1059,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -1056,7 +1080,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -1076,7 +1101,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -1096,7 +1122,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "wiki_duck",
"origin": {
"mimetype": "text/html",
@ -8489,7 +8489,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/1",
@ -8648,7 +8649,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -0,0 +1,16 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: paragraph: Transcript
item-2 at level 1: paragraph: February 20, 2025, 8:32PM
item-3 at level 1: picture
item-4 at level 1: inline: group group
item-5 at level 2: paragraph: This is test 1
item-6 at level 2: paragraph: 0:08
Correct, he is not.
item-7 at level 1: paragraph:
item-8 at level 1: picture
item-9 at level 1: inline: group group
item-10 at level 2: paragraph: This is test 2
item-11 at level 2: paragraph: 0:16
Yeah, exactly.
item-12 at level 1: paragraph:
item-13 at level 1: paragraph:

View File

@ -0,0 +1,292 @@
{
"schema_name": "DoclingDocument",
"version": "1.4.0",
"name": "word_image_anchors",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"binary_hash": 2428692234257307633,
"filename": "word_image_anchors.docx"
},
"furniture": {
"self_ref": "#/furniture",
"children": [],
"content_layer": "furniture",
"name": "_root_",
"label": "unspecified"
},
"body": {
"self_ref": "#/body",
"children": [
{
"$ref": "#/texts/0"
},
{
"$ref": "#/texts/1"
},
{
"$ref": "#/pictures/0"
},
{
"$ref": "#/groups/0"
},
{
"$ref": "#/texts/4"
},
{
"$ref": "#/pictures/1"
},
{
"$ref": "#/groups/1"
},
{
"$ref": "#/texts/7"
},
{
"$ref": "#/texts/8"
}
],
"content_layer": "body",
"name": "_root_",
"label": "unspecified"
},
"groups": [
{
"self_ref": "#/groups/0",
"parent": {
"$ref": "#/body"
},
"children": [
{
"$ref": "#/texts/2"
},
{
"$ref": "#/texts/3"
}
],
"content_layer": "body",
"name": "group",
"label": "inline"
},
{
"self_ref": "#/groups/1",
"parent": {
"$ref": "#/body"
},
"children": [
{
"$ref": "#/texts/5"
},
{
"$ref": "#/texts/6"
}
],
"content_layer": "body",
"name": "group",
"label": "inline"
}
],
"texts": [
{
"self_ref": "#/texts/0",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "Transcript",
"text": "Transcript",
"formatting": {
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/1",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "February 20, 2025, 8:32PM",
"text": "February 20, 2025, 8:32PM",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/2",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is test 1",
"text": "This is test 1",
"formatting": {
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/3",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "0:08\nCorrect, he is not.",
"text": "0:08\nCorrect, he is not.",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/4",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/5",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is test 2",
"text": "This is test 2",
"formatting": {
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/6",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "0:16\nYeah, exactly.",
"text": "0:16\nYeah, exactly.",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/7",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/8",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
}
],
"pictures": [
{
"self_ref": "#/pictures/0",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "picture",
"prov": [],
"captions": [],
"references": [],
"footnotes": [],
"image": {
"mimetype": "image/png",
"dpi": 72,
"size": {
"width": 100.0,
"height": 100.0
},
"uri": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGQAAABkCAYAAABw4pVUAAAAz0lEQVR4nO3bUW0CURRF0TukQvDSauBr0mACE1VBAzYQg5Lpdw0wO2EtA+cl+/6+GQAAAAAAAAAAAADe1DIR53X9mcNcdhnf5nm93Y8T8DElyzyuv/evlx/CMqeJOOz9AP4TJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiBIkRJEaQGEFiWp8+t/k8f6/bDrvPl28CAAAAAAAAAAAAAAAAzLv5A5bTEG2TIIlOAAAAAElFTkSuQmCC"
},
"annotations": []
},
{
"self_ref": "#/pictures/1",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "picture",
"prov": [],
"captions": [],
"references": [],
"footnotes": [],
"image": {
"mimetype": "image/png",
"dpi": 72,
"size": {
"width": 100.0,
"height": 100.0
},
"uri": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGQAAABkCAYAAABw4pVUAAAJIElEQVR4nO2dbWxb1RnH/8+1c5O4bITEwJrRF1ZAI6gtL9oK29oxihAdTQOVoGhbKyS0MDWZJk1CQ+q0aR/4xLYvJNGabdK07MukrSUNaxEvg7aUlteuLUoHrUTbseylSRSgpLGd3Ac9596kSWzHvva1fXzv/UmW4jaxj5+/z73nPOec50/QnM3t5xdbUWOlZeBGgK8jNpYC3AxQHOAGEMXAXKN+mSgF5nGAxgAeBmiIyToH0GnDwklj0jqxq/fK/0BjCJrR2jn8ZcPCXSBaC9DtAC/39h3oDMBHwHzQMvD3ga74P6ERWgjS1jG8BjAeALgVQEuZ334QoAHA2t3fHX8dQRWktX0obpi1jzDjewSshgYwcIwIf7KSiT8M9DYPB0KQts7RlWDuANCuSw/NAAPoBVF3f1fjCZQRKq8QeBzgragqqA+Ep8olDJXj0kSm+XNi6kQVw8RdnEz+otSXspIK0rZ9eDuIngTQAH8wBuYd/T3xnqoSRIauERi/ZuYN8CFEtG8K1o9LMWT2XJBN20e+TwZ1gdmEnyFKssWde3qafuvpy3r5Ym0dI78B8BiCxc7+7qYfaCXIxvbRpZEa7gOwDsHkwFSKtj7b23iu4oLYs2z6M4BlCDZnAd5S7Gy/KEHu3z5yDxN2AVhUzOv4iE+JsfmZnqbnyy7Iph+O3kcWD2g8264UzAa17nm68W+F/DEV0TOeC8XIChPj3kJ6ChV4z3gpvEzl5FOA17u9pxhuR1PODTwUIzeLJFZ2zEokiDO0Dfpoyg3LnJh5L4gz6QvqPKMY1jmx804QSYcEcAbuJY85MSz+pq7WuGEc831uqtQQJS1Yq3MlJHP2EMnahmJ4ALOpYpkDI9d6hl9T6JVAYqnWiAq5ZKlNCDW1p3y0uKQLY1YqcX22lcesPUSWXUMxSkKDE9v8e4izM+R4adoToiBalWnjROYeonaHhJSULDFO6yFh76hsL0nvIfYmtpBykCHWlGFk9X8d0uqrbqjBj7YtQlODq3QbLAtIphgffcL44N+TeO1oEgfeSkJT2Eolrpo94orO/l/ZawuuvBjFYBhAXS2px9VxE2tWmdjWZmHvgQnsemECmkEq5sAvp/9hztdPNj7DZxAB8SsMfLc1hscfvQz1dXp93+bH3Ji98KTLLvRSEDGAO1abaH8wBp2QmNuLfmk9RM5n+BvDAL6y0sTa23RLzV2K/ax7iDosUzUcOprEmydS6udoBFixJIovLYng2msiMGuyX5YW1RNuXBHFwbd1utGr2D8xI4ik2MFlP7lUFBcnGK+8kZh5/uJh+2e5ibc/FMs6OpN7yjVXR6AZLaKBpOZVq9WZPp/w+vEkXjqcwOQUqoppDeyvkTpg6R+GzltIpeQQVBXhaOD0azntGgyYgQ//p2P3sTUw5By490ePK8u1X4zANDPf2D+6YOGtd+3BgF7wctHCkEP58BG3ttTgG7eZat6RKa0iYrwzqKMggGgRlQoJVGWX20xcvyyKDetqcfsqE7F6yiiGCPH7v45DV0SLqJSr0CCX6Jq776hVj3yQZKMMi/v2XFTDZX3h66JO7RD4kQvjjKMnU3j2lQm898EkdEe0kB7SDJ9yWYzwtVtMLGuOqEnkvoMJ3XtIs2FX1fEvEQNYujiCrZti+NVPPq9m8vpCcUOVOAoAREDzlRGVVpGRmJ5wQ9SpN4Vq49Cs5KJQZwI3LJcEYxRLFkcyDnsFyXFta4vh/OgF/Ou/mk0QiWLRmeJfVcbFeclF4blX7ecy+vrOxno0Xp5ZlSVfiOBba2rxx37NhsDMNe4WrKuEFw8nsHd/Qg13MxGJAC0r5qxea4OhyuL5kGPvpTD2cfZLsfQemUxqBVHKsGsU+o9TZyeRWCDjK72kvlazCTHzuAx7x+BDWlZEEVtgQ8PUFHAxodtghsZk2FuRUnal5tabanD557LfIkUM6UV6wcPSQ4bgMzbeWYcNa+vUWnsmZJR/bkizIa+ChqJS15ZYs2tpHsj+qju/eim5KMGXeYg8FpqHCOMTjBOn9BvLiBZRKTKMKuTrt5jq4RbpHYOnJ/H8oblzGD2g04ZUfEaAODs0pd+E0EG0iEr57Sl/zg/Tesbpc5P43V/G9UuZOIgWhl0LXcpv+5ePLzAGXp7Az57+RON1ETojWjhTVT4CwDcbHZIpVjfuMx9O4cjxJPa/mdR8HWRGA2crKfNBED0MjTj+fgqP/tSXc9bMiAbT+7LEJaDS7Qk6lqOBEsQp9zBY6UYFmMHpkhuzhldi2RBSGS7FfpYg1u4KtSYEl2I/J2eyqWPkH34+RaUj4lmyp7vp5unnc2aEYmZSkVYFGJoX8zmCiLOMY2YSUh7YiXlmQZzz0r1lakwI0Du/KlB6Eouou5wtCjSUHus0QezaG+SqkmZIIVBf/tWACE8V9B4h+ZMlxhkFEeXEc8nFy4e4QGKbzWQs60KIGGApz6UQrxlzYgtXgqi7P/MOz5sTdJh3LOT0lnN3w/2do3vDyqTemYk909X47YV+J+farbiRSRFgj9oUXIiSKpY5yCmIpIXFjcyzhgUUtrgzH5u9vHY3ONZwOz1pWTDZma+9nqsdcm0dI/tDhwTXHOjvbvpmvr/sav+PWMPZbmQheXLWiRlKIojt08dbbDufkDwsj7a49TZ0vUNOPJXEGi5M0+c0BdtciKdhQVsWxX1MrOFCUbLb5hXqZVjwHlLx6RNruPDylWYseW+hHoZCUZt67W8Brw9v9DPWq+uLcfkUQnNiv5kTzya07y4eT88hSMOY0R6I3BdRUj6rl2IInh8MkRSBuJFJZhM+hYj2yWfMNx3i6rVRQpQBFtGTPrJOGpP1jP6eeE+p3qCkR6ek4WKA5YflYCbuks9SSjGEsh2/tZ17xOaHXeV2Kg/1yYaEbGvgnr8byoxjqSTOMu06GMdkQTIQvbJvqlxCTFOxgCg3H7P2EfHP0GWDNwPHZK+tbO9caN27lGjxDbX9M8SyQbkElNsUYNA+n2HtLiQZ6EtB0syQLdxl10KX8tteV92WE8d8RM70yTGyfJZVAy0I5iHlt6XisxQZlrq2TlnbZrt4Jzc4JQrtqnhS+0uVm5IKR1JUh4akXIWqkGDhpJwDt4+B68tnvr6L5zB8YjIAAAAASUVORK5CYII="
},
"annotations": []
}
],
"tables": [],
"key_value_items": [],
"form_items": [],
"pages": {}
}

View File

@ -0,0 +1,13 @@
**Transcript**
February 20, 2025, 8:32PM
<!-- image -->
**This is test 1** 0:08
Correct, he is not.
<!-- image -->
**This is test 2** 0:16
Yeah, exactly.

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "word_sample",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -106,7 +106,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -149,7 +150,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -167,7 +169,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -217,7 +220,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -235,7 +239,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -255,7 +260,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -275,7 +281,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -295,7 +302,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -313,7 +321,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -333,7 +342,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -353,7 +363,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -373,7 +384,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -426,7 +438,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -444,7 +457,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -462,7 +476,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -492,7 +507,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -510,7 +526,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -530,7 +547,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -550,7 +568,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
},
"enumerated": false,
"marker": "-"
@ -897,7 +916,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "word_tables",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
@ -119,7 +119,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -149,7 +150,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -179,7 +181,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -209,7 +212,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -239,7 +243,8 @@
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
"strikethrough": false,
"script": "baseline"
}
},
{
@ -510,7 +515,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/1",
@ -729,7 +735,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/2",
@ -1020,7 +1027,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/3",
@ -1387,7 +1395,8 @@
}
]
]
}
},
"annotations": []
},
{
"self_ref": "#/tables/4",
@ -2398,7 +2407,8 @@
}
]
]
}
},
"annotations": []
}
],
"key_value_items": [],

View File

@ -11,12 +11,22 @@ Create your feature branch: `git checkout -b feature/AmazingFeature`.
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
6. **Whole list item has same formatting**
7. List item has *mixed or partial* formatting
## *Second* section <!-- inline groups in headings not yet supported by serializers -->
# *Whole heading is italic*
<<<<<<< HEAD
- **First**: Lorem ipsum.
- **Second**: Dolor `sit` amet.
| **Bold Heading** | *Italic Heading* |
|------------------|------------------|
| data a | data b |
=======
Some *`formatted_code`*
## *Partially formatted* heading to_escape `not_to_escape`
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
>>>>>>> origin/main

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "webp-test",
"origin": {
"mimetype": "application/pdf",

View File

@ -1,6 +1,6 @@
{
"schema_name": "DoclingDocument",
"version": "1.3.0",
"version": "1.4.0",
"name": "ocr_test",
"origin": {
"mimetype": "application/pdf",

Some files were not shown because too many files have changed in this diff Show More