mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-26 20:14:47 +00:00
Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
commit
4cfb2cd0a9
6
.github/workflows/checks.yml
vendored
6
.github/workflows/checks.yml
vendored
@ -22,8 +22,8 @@ jobs:
|
|||||||
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
|
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
|
||||||
steps:
|
steps:
|
||||||
- uses: actions/checkout@v4
|
- uses: actions/checkout@v4
|
||||||
- name: Install tesseract
|
- name: Install tesseract and ffmpeg
|
||||||
run: sudo apt-get update && sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa tesseract-ocr-script-latn libleptonica-dev libtesseract-dev pkg-config
|
run: sudo apt-get update && sudo apt-get install -y ffmpeg tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa tesseract-ocr-script-latn libleptonica-dev libtesseract-dev pkg-config
|
||||||
- name: Set TESSDATA_PREFIX
|
- name: Set TESSDATA_PREFIX
|
||||||
run: |
|
run: |
|
||||||
echo "TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)" >> "$GITHUB_ENV"
|
echo "TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)" >> "$GITHUB_ENV"
|
||||||
@ -60,7 +60,7 @@ jobs:
|
|||||||
run: |
|
run: |
|
||||||
for file in docs/examples/*.py; do
|
for file in docs/examples/*.py; do
|
||||||
# Skip batch_convert.py
|
# Skip batch_convert.py
|
||||||
if [[ "$(basename "$file")" =~ ^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model).py ]]; then
|
if [[ "$(basename "$file")" =~ ^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model).py ]]; then
|
||||||
echo "Skipping $file"
|
echo "Skipping $file"
|
||||||
continue
|
continue
|
||||||
fi
|
fi
|
||||||
|
39
CHANGELOG.md
39
CHANGELOG.md
@ -1,3 +1,42 @@
|
|||||||
|
## [v2.39.0](https://github.com/docling-project/docling/releases/tag/v2.39.0) - 2025-06-27
|
||||||
|
|
||||||
|
### Feature
|
||||||
|
|
||||||
|
* Leverage new list modeling, capture default markers ([#1856](https://github.com/docling-project/docling/issues/1856)) ([`0533da1`](https://github.com/docling-project/docling/commit/0533da1923598e4a2d6392283f6de0f9c7002b01))
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
|
||||||
|
* **markdown:** Make parsing of rich table cells valid ([#1821](https://github.com/docling-project/docling/issues/1821)) ([`e79e4f0`](https://github.com/docling-project/docling/commit/e79e4f0ab6c5b8276316e423b14c9821165049f2))
|
||||||
|
|
||||||
|
## [v2.38.1](https://github.com/docling-project/docling/releases/tag/v2.38.1) - 2025-06-25
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
|
||||||
|
* Updated granite vision model version for picture description ([#1852](https://github.com/docling-project/docling/issues/1852)) ([`d337825`](https://github.com/docling-project/docling/commit/d337825b8ef9ab3ec00c1496c340041e406bd271))
|
||||||
|
* **markdown:** Fix single-formatted headings & list items ([#1820](https://github.com/docling-project/docling/issues/1820)) ([`7c5614a`](https://github.com/docling-project/docling/commit/7c5614a37a316950c9a1d123e4fd94e0e831aca0))
|
||||||
|
* Fix response type of ollama ([#1850](https://github.com/docling-project/docling/issues/1850)) ([`41e8cae`](https://github.com/docling-project/docling/commit/41e8cae26b625b95ffab021fb4dc337249e8caad))
|
||||||
|
* Handle missing runs to avoid out of range exception ([#1844](https://github.com/docling-project/docling/issues/1844)) ([`4002de1`](https://github.com/docling-project/docling/commit/4002de1f9220a6568ed87ba726254cde3ab1168a))
|
||||||
|
|
||||||
|
## [v2.38.0](https://github.com/docling-project/docling/releases/tag/v2.38.0) - 2025-06-23
|
||||||
|
|
||||||
|
### Feature
|
||||||
|
|
||||||
|
* Support audio input ([#1763](https://github.com/docling-project/docling/issues/1763)) ([`1557e7c`](https://github.com/docling-project/docling/commit/1557e7ce3e036fb51eb118296f5cbff3b6dfbfa7))
|
||||||
|
* **markdown:** Add formatting & improve inline support ([#1804](https://github.com/docling-project/docling/issues/1804)) ([`861abcd`](https://github.com/docling-project/docling/commit/861abcdcb0d406342b9566f81203b87cf32b7ad0))
|
||||||
|
* Maximum image size for Vlm models ([#1802](https://github.com/docling-project/docling/issues/1802)) ([`215b540`](https://github.com/docling-project/docling/commit/215b540f6c078a72464310ef22975ebb6cde4f0a))
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
|
||||||
|
* **docx:** Ensure list items have a list parent ([#1827](https://github.com/docling-project/docling/issues/1827)) ([`d26dac6`](https://github.com/docling-project/docling/commit/d26dac61a86b0af5b16686f78956ba047bcbddba))
|
||||||
|
* **msword_backend:** Identify text in the same line after an image #1425 ([#1610](https://github.com/docling-project/docling/issues/1610)) ([`1350a8d`](https://github.com/docling-project/docling/commit/1350a8d3e5ea3c4b4d506757758880c8f78efd8c))
|
||||||
|
* Ensure uninitialized pages are removed before assembling document ([#1812](https://github.com/docling-project/docling/issues/1812)) ([`dd7f64f`](https://github.com/docling-project/docling/commit/dd7f64ff28226cd9964fc4d8ba807b2c8a6358ef))
|
||||||
|
* Formula conversion with page_range param set ([#1791](https://github.com/docling-project/docling/issues/1791)) ([`dbab30e`](https://github.com/docling-project/docling/commit/dbab30e92cc1d130ce7f9335ab9c46aa7a30930d))
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
|
||||||
|
* Update readme and add ASR example ([#1836](https://github.com/docling-project/docling/issues/1836)) ([`f3ae302`](https://github.com/docling-project/docling/commit/f3ae3029b8a6d6f0109383fbc82ebf9da3942afd))
|
||||||
|
* Support running examples from root or subfolder ([#1816](https://github.com/docling-project/docling/issues/1816)) ([`64ac043`](https://github.com/docling-project/docling/commit/64ac043786efdece0c61827051a5b41dddf6c5d7))
|
||||||
|
|
||||||
## [v2.37.0](https://github.com/docling-project/docling/releases/tag/v2.37.0) - 2025-06-16
|
## [v2.37.0](https://github.com/docling-project/docling/releases/tag/v2.37.0) - 2025-06-16
|
||||||
|
|
||||||
### Feature
|
### Feature
|
||||||
|
@ -28,14 +28,15 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
|||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
|
||||||
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||||
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
|
||||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||||
* 🔍 Extensive OCR support for scanned PDFs and images
|
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||||
* 🥚 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
|
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
|
||||||
|
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models
|
||||||
* 💻 Simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
### Coming soon
|
### Coming soon
|
||||||
|
@ -17,6 +17,7 @@ from docling_core.types.doc import (
|
|||||||
TableData,
|
TableData,
|
||||||
)
|
)
|
||||||
from docling_core.types.doc.document import ContentLayer
|
from docling_core.types.doc.document import ContentLayer
|
||||||
|
from pydantic import BaseModel
|
||||||
from typing_extensions import override
|
from typing_extensions import override
|
||||||
|
|
||||||
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
||||||
@ -48,6 +49,11 @@ TAGS_FOR_NODE_ITEMS: Final = [
|
|||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
class _Context(BaseModel):
|
||||||
|
list_ordered_flag_by_ref: dict[str, bool] = {}
|
||||||
|
list_start_by_ref: dict[str, int] = {}
|
||||||
|
|
||||||
|
|
||||||
class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
||||||
@override
|
@override
|
||||||
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
||||||
@ -59,6 +65,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.max_levels = 10
|
self.max_levels = 10
|
||||||
self.level = 0
|
self.level = 0
|
||||||
self.parents: dict[int, Optional[Union[DocItem, GroupItem]]] = {}
|
self.parents: dict[int, Optional[Union[DocItem, GroupItem]]] = {}
|
||||||
|
self.ctx = _Context()
|
||||||
for i in range(self.max_levels):
|
for i in range(self.max_levels):
|
||||||
self.parents[i] = None
|
self.parents[i] = None
|
||||||
|
|
||||||
@ -121,6 +128,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.content_layer = (
|
self.content_layer = (
|
||||||
ContentLayer.BODY if headers is None else ContentLayer.FURNITURE
|
ContentLayer.BODY if headers is None else ContentLayer.FURNITURE
|
||||||
)
|
)
|
||||||
|
self.ctx = _Context() # reset context
|
||||||
self.walk(content, doc)
|
self.walk(content, doc)
|
||||||
else:
|
else:
|
||||||
raise RuntimeError(
|
raise RuntimeError(
|
||||||
@ -294,28 +302,25 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
def handle_list(self, element: Tag, doc: DoclingDocument) -> None:
|
def handle_list(self, element: Tag, doc: DoclingDocument) -> None:
|
||||||
"""Handles list tags (ul, ol) and their list items."""
|
"""Handles list tags (ul, ol) and their list items."""
|
||||||
|
|
||||||
if element.name == "ul":
|
start: Optional[int] = None
|
||||||
# create a list group
|
if is_ordered := element.name == "ol":
|
||||||
self.parents[self.level + 1] = doc.add_group(
|
|
||||||
parent=self.parents[self.level],
|
|
||||||
name="list",
|
|
||||||
label=GroupLabel.LIST,
|
|
||||||
content_layer=self.content_layer,
|
|
||||||
)
|
|
||||||
elif element.name == "ol":
|
|
||||||
start_attr = element.get("start")
|
start_attr = element.get("start")
|
||||||
start: int = (
|
if isinstance(start_attr, str) and start_attr.isnumeric():
|
||||||
int(start_attr)
|
start = int(start_attr)
|
||||||
if isinstance(start_attr, str) and start_attr.isnumeric()
|
name = "ordered list" + (f" start {start}" if start is not None else "")
|
||||||
else 1
|
else:
|
||||||
)
|
name = "list"
|
||||||
# create a list group
|
# create a list group
|
||||||
self.parents[self.level + 1] = doc.add_group(
|
list_group = doc.add_list_group(
|
||||||
parent=self.parents[self.level],
|
name=name,
|
||||||
name="ordered list" + (f" start {start}" if start != 1 else ""),
|
parent=self.parents[self.level],
|
||||||
label=GroupLabel.ORDERED_LIST,
|
content_layer=self.content_layer,
|
||||||
content_layer=self.content_layer,
|
)
|
||||||
)
|
self.parents[self.level + 1] = list_group
|
||||||
|
self.ctx.list_ordered_flag_by_ref[list_group.self_ref] = is_ordered
|
||||||
|
if is_ordered and start is not None:
|
||||||
|
self.ctx.list_start_by_ref[list_group.self_ref] = start
|
||||||
|
|
||||||
self.level += 1
|
self.level += 1
|
||||||
|
|
||||||
self.walk(element, doc)
|
self.walk(element, doc)
|
||||||
@ -331,16 +336,11 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
if parent is None:
|
if parent is None:
|
||||||
_log.debug(f"list-item has no parent in DoclingDocument: {element}")
|
_log.debug(f"list-item has no parent in DoclingDocument: {element}")
|
||||||
return
|
return
|
||||||
parent_label: str = parent.label
|
enumerated = self.ctx.list_ordered_flag_by_ref.get(parent.self_ref, False)
|
||||||
index_in_list = len(parent.children) + 1
|
if enumerated and (start := self.ctx.list_start_by_ref.get(parent.self_ref)):
|
||||||
if (
|
marker = f"{start + len(parent.children)}."
|
||||||
parent_label == GroupLabel.ORDERED_LIST
|
else:
|
||||||
and isinstance(parent, GroupItem)
|
marker = ""
|
||||||
and parent.name
|
|
||||||
):
|
|
||||||
start_in_list: str = parent.name.split(" ")[-1]
|
|
||||||
start: int = int(start_in_list) if start_in_list.isnumeric() else 1
|
|
||||||
index_in_list += start - 1
|
|
||||||
|
|
||||||
if nested_list:
|
if nested_list:
|
||||||
# Text in list item can be hidden within hierarchy, hence
|
# Text in list item can be hidden within hierarchy, hence
|
||||||
@ -350,12 +350,6 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
text = text.replace("\n", "").replace("\r", "")
|
text = text.replace("\n", "").replace("\r", "")
|
||||||
text = " ".join(text.split()).strip()
|
text = " ".join(text.split()).strip()
|
||||||
|
|
||||||
marker = ""
|
|
||||||
enumerated = False
|
|
||||||
if parent_label == GroupLabel.ORDERED_LIST:
|
|
||||||
marker = str(index_in_list)
|
|
||||||
enumerated = True
|
|
||||||
|
|
||||||
if len(text) > 0:
|
if len(text) > 0:
|
||||||
# create a list-item
|
# create a list-item
|
||||||
self.parents[self.level + 1] = doc.add_list_item(
|
self.parents[self.level + 1] = doc.add_list_item(
|
||||||
@ -375,11 +369,6 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
elif element.text.strip():
|
elif element.text.strip():
|
||||||
text = element.text.strip()
|
text = element.text.strip()
|
||||||
|
|
||||||
marker = ""
|
|
||||||
enumerated = False
|
|
||||||
if parent_label == GroupLabel.ORDERED_LIST:
|
|
||||||
marker = f"{index_in_list!s}."
|
|
||||||
enumerated = True
|
|
||||||
doc.add_list_item(
|
doc.add_list_item(
|
||||||
text=text,
|
text=text,
|
||||||
enumerated=enumerated,
|
enumerated=enumerated,
|
||||||
|
@ -2,9 +2,10 @@ import logging
|
|||||||
import re
|
import re
|
||||||
import warnings
|
import warnings
|
||||||
from copy import deepcopy
|
from copy import deepcopy
|
||||||
|
from enum import Enum
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import List, Optional, Set, Union
|
from typing import List, Literal, Optional, Set, Union
|
||||||
|
|
||||||
import marko
|
import marko
|
||||||
import marko.element
|
import marko.element
|
||||||
@ -13,15 +14,15 @@ from docling_core.types.doc import (
|
|||||||
DocItemLabel,
|
DocItemLabel,
|
||||||
DoclingDocument,
|
DoclingDocument,
|
||||||
DocumentOrigin,
|
DocumentOrigin,
|
||||||
GroupLabel,
|
|
||||||
NodeItem,
|
NodeItem,
|
||||||
TableCell,
|
TableCell,
|
||||||
TableData,
|
TableData,
|
||||||
TextItem,
|
TextItem,
|
||||||
)
|
)
|
||||||
from docling_core.types.doc.document import Formatting, OrderedList, UnorderedList
|
from docling_core.types.doc.document import Formatting
|
||||||
from marko import Markdown
|
from marko import Markdown
|
||||||
from pydantic import AnyUrl, TypeAdapter
|
from pydantic import AnyUrl, BaseModel, Field, TypeAdapter
|
||||||
|
from typing_extensions import Annotated
|
||||||
|
|
||||||
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
||||||
from docling.backend.html_backend import HTMLDocumentBackend
|
from docling.backend.html_backend import HTMLDocumentBackend
|
||||||
@ -35,6 +36,32 @@ _START_MARKER = f"#_#_{_MARKER_BODY}_START_#_#"
|
|||||||
_STOP_MARKER = f"#_#_{_MARKER_BODY}_STOP_#_#"
|
_STOP_MARKER = f"#_#_{_MARKER_BODY}_STOP_#_#"
|
||||||
|
|
||||||
|
|
||||||
|
class _PendingCreationType(str, Enum):
|
||||||
|
"""CoordOrigin."""
|
||||||
|
|
||||||
|
HEADING = "heading"
|
||||||
|
LIST_ITEM = "list_item"
|
||||||
|
|
||||||
|
|
||||||
|
class _HeadingCreationPayload(BaseModel):
|
||||||
|
kind: Literal["heading"] = "heading"
|
||||||
|
level: int
|
||||||
|
|
||||||
|
|
||||||
|
class _ListItemCreationPayload(BaseModel):
|
||||||
|
kind: Literal["list_item"] = "list_item"
|
||||||
|
enumerated: bool
|
||||||
|
|
||||||
|
|
||||||
|
_CreationPayload = Annotated[
|
||||||
|
Union[
|
||||||
|
_HeadingCreationPayload,
|
||||||
|
_ListItemCreationPayload,
|
||||||
|
],
|
||||||
|
Field(discriminator="kind"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||||
def _shorten_underscore_sequences(self, markdown_text: str, max_length: int = 10):
|
def _shorten_underscore_sequences(self, markdown_text: str, max_length: int = 10):
|
||||||
# This regex will match any sequence of underscores
|
# This regex will match any sequence of underscores
|
||||||
@ -155,6 +182,50 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
doc.add_table(data=table_data)
|
doc.add_table(data=table_data)
|
||||||
return
|
return
|
||||||
|
|
||||||
|
def _create_list_item(
|
||||||
|
self,
|
||||||
|
doc: DoclingDocument,
|
||||||
|
parent_item: Optional[NodeItem],
|
||||||
|
text: str,
|
||||||
|
enumerated: bool,
|
||||||
|
formatting: Optional[Formatting] = None,
|
||||||
|
hyperlink: Optional[Union[AnyUrl, Path]] = None,
|
||||||
|
):
|
||||||
|
item = doc.add_list_item(
|
||||||
|
text=text,
|
||||||
|
enumerated=enumerated,
|
||||||
|
parent=parent_item,
|
||||||
|
formatting=formatting,
|
||||||
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
|
return item
|
||||||
|
|
||||||
|
def _create_heading_item(
|
||||||
|
self,
|
||||||
|
doc: DoclingDocument,
|
||||||
|
parent_item: Optional[NodeItem],
|
||||||
|
text: str,
|
||||||
|
level: int,
|
||||||
|
formatting: Optional[Formatting] = None,
|
||||||
|
hyperlink: Optional[Union[AnyUrl, Path]] = None,
|
||||||
|
):
|
||||||
|
if level == 1:
|
||||||
|
item = doc.add_title(
|
||||||
|
text=text,
|
||||||
|
parent=parent_item,
|
||||||
|
formatting=formatting,
|
||||||
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
item = doc.add_heading(
|
||||||
|
text=text,
|
||||||
|
level=level - 1,
|
||||||
|
parent=parent_item,
|
||||||
|
formatting=formatting,
|
||||||
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
|
return item
|
||||||
|
|
||||||
def _iterate_elements( # noqa: C901
|
def _iterate_elements( # noqa: C901
|
||||||
self,
|
self,
|
||||||
*,
|
*,
|
||||||
@ -162,6 +233,10 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
depth: int,
|
depth: int,
|
||||||
doc: DoclingDocument,
|
doc: DoclingDocument,
|
||||||
visited: Set[marko.element.Element],
|
visited: Set[marko.element.Element],
|
||||||
|
creation_stack: list[
|
||||||
|
_CreationPayload
|
||||||
|
], # stack for lazy item creation triggered deep in marko's AST (on RawText)
|
||||||
|
list_ordered_flag_by_ref: dict[str, bool],
|
||||||
parent_item: Optional[NodeItem] = None,
|
parent_item: Optional[NodeItem] = None,
|
||||||
formatting: Optional[Formatting] = None,
|
formatting: Optional[Formatting] = None,
|
||||||
hyperlink: Optional[Union[AnyUrl, Path]] = None,
|
hyperlink: Optional[Union[AnyUrl, Path]] = None,
|
||||||
@ -177,28 +252,17 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
f" - Heading level {element.level}, content: {element.children[0].children}" # type: ignore
|
f" - Heading level {element.level}, content: {element.children[0].children}" # type: ignore
|
||||||
)
|
)
|
||||||
|
|
||||||
if len(element.children) == 1:
|
if len(element.children) > 1: # inline group will be created further down
|
||||||
child = element.children[0]
|
parent_item = self._create_heading_item(
|
||||||
snippet_text = str(child.children) # type: ignore
|
doc=doc,
|
||||||
visited.add(child)
|
parent_item=parent_item,
|
||||||
else:
|
text="",
|
||||||
snippet_text = "" # inline group will be created
|
level=element.level,
|
||||||
|
|
||||||
if element.level == 1:
|
|
||||||
parent_item = doc.add_title(
|
|
||||||
text=snippet_text,
|
|
||||||
parent=parent_item,
|
|
||||||
formatting=formatting,
|
formatting=formatting,
|
||||||
hyperlink=hyperlink,
|
hyperlink=hyperlink,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
parent_item = doc.add_heading(
|
creation_stack.append(_HeadingCreationPayload(level=element.level))
|
||||||
text=snippet_text,
|
|
||||||
level=element.level - 1,
|
|
||||||
parent=parent_item,
|
|
||||||
formatting=formatting,
|
|
||||||
hyperlink=hyperlink,
|
|
||||||
)
|
|
||||||
|
|
||||||
elif isinstance(element, marko.block.List):
|
elif isinstance(element, marko.block.List):
|
||||||
has_non_empty_list_items = False
|
has_non_empty_list_items = False
|
||||||
@ -210,10 +274,8 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self._close_table(doc)
|
self._close_table(doc)
|
||||||
_log.debug(f" - List {'ordered' if element.ordered else 'unordered'}")
|
_log.debug(f" - List {'ordered' if element.ordered else 'unordered'}")
|
||||||
if has_non_empty_list_items:
|
if has_non_empty_list_items:
|
||||||
label = GroupLabel.ORDERED_LIST if element.ordered else GroupLabel.LIST
|
parent_item = doc.add_list_group(name="list", parent=parent_item)
|
||||||
parent_item = doc.add_group(
|
list_ordered_flag_by_ref[parent_item.self_ref] = element.ordered
|
||||||
label=label, name="list", parent=parent_item
|
|
||||||
)
|
|
||||||
|
|
||||||
elif (
|
elif (
|
||||||
isinstance(element, marko.block.ListItem)
|
isinstance(element, marko.block.ListItem)
|
||||||
@ -224,22 +286,22 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self._close_table(doc)
|
self._close_table(doc)
|
||||||
_log.debug(" - List item")
|
_log.debug(" - List item")
|
||||||
|
|
||||||
if len(child.children) == 1:
|
enumerated = (
|
||||||
snippet_text = str(child.children[0].children) # type: ignore
|
list_ordered_flag_by_ref.get(parent_item.self_ref, False)
|
||||||
visited.add(child)
|
if parent_item
|
||||||
else:
|
else False
|
||||||
snippet_text = "" # inline group will be created
|
|
||||||
is_numbered = isinstance(parent_item, OrderedList)
|
|
||||||
if not isinstance(parent_item, (OrderedList, UnorderedList)):
|
|
||||||
_log.warning("ListItem would have not had a list parent, adding one.")
|
|
||||||
parent_item = doc.add_unordered_list(parent=parent_item)
|
|
||||||
parent_item = doc.add_list_item(
|
|
||||||
enumerated=is_numbered,
|
|
||||||
parent=parent_item,
|
|
||||||
text=snippet_text,
|
|
||||||
formatting=formatting,
|
|
||||||
hyperlink=hyperlink,
|
|
||||||
)
|
)
|
||||||
|
if len(child.children) > 1: # inline group will be created further down
|
||||||
|
parent_item = self._create_list_item(
|
||||||
|
doc=doc,
|
||||||
|
parent_item=parent_item,
|
||||||
|
text="",
|
||||||
|
enumerated=enumerated,
|
||||||
|
formatting=formatting,
|
||||||
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
creation_stack.append(_ListItemCreationPayload(enumerated=enumerated))
|
||||||
|
|
||||||
elif isinstance(element, marko.inline.Image):
|
elif isinstance(element, marko.inline.Image):
|
||||||
self._close_table(doc)
|
self._close_table(doc)
|
||||||
@ -276,7 +338,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
_log.debug(f" - Paragraph (raw text): {element.children}")
|
_log.debug(f" - Paragraph (raw text): {element.children}")
|
||||||
snippet_text = element.children.strip()
|
snippet_text = element.children.strip()
|
||||||
# Detect start of the table:
|
# Detect start of the table:
|
||||||
if "|" in snippet_text:
|
if "|" in snippet_text or self.in_table:
|
||||||
# most likely part of the markdown table
|
# most likely part of the markdown table
|
||||||
self.in_table = True
|
self.in_table = True
|
||||||
if len(self.md_table_buffer) > 0:
|
if len(self.md_table_buffer) > 0:
|
||||||
@ -285,13 +347,46 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.md_table_buffer.append(snippet_text)
|
self.md_table_buffer.append(snippet_text)
|
||||||
elif snippet_text:
|
elif snippet_text:
|
||||||
self._close_table(doc)
|
self._close_table(doc)
|
||||||
doc.add_text(
|
|
||||||
label=DocItemLabel.TEXT,
|
if creation_stack:
|
||||||
parent=parent_item,
|
while len(creation_stack) > 0:
|
||||||
text=snippet_text,
|
to_create = creation_stack.pop()
|
||||||
formatting=formatting,
|
if isinstance(to_create, _ListItemCreationPayload):
|
||||||
hyperlink=hyperlink,
|
enumerated = (
|
||||||
)
|
list_ordered_flag_by_ref.get(
|
||||||
|
parent_item.self_ref, False
|
||||||
|
)
|
||||||
|
if parent_item
|
||||||
|
else False
|
||||||
|
)
|
||||||
|
parent_item = self._create_list_item(
|
||||||
|
doc=doc,
|
||||||
|
parent_item=parent_item,
|
||||||
|
text=snippet_text,
|
||||||
|
enumerated=enumerated,
|
||||||
|
formatting=formatting,
|
||||||
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
|
elif isinstance(to_create, _HeadingCreationPayload):
|
||||||
|
# not keeping as parent_item as logic for correctly tracking
|
||||||
|
# that not implemented yet (section components not captured
|
||||||
|
# as heading children in marko)
|
||||||
|
self._create_heading_item(
|
||||||
|
doc=doc,
|
||||||
|
parent_item=parent_item,
|
||||||
|
text=snippet_text,
|
||||||
|
level=to_create.level,
|
||||||
|
formatting=formatting,
|
||||||
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
doc.add_text(
|
||||||
|
label=DocItemLabel.TEXT,
|
||||||
|
parent=parent_item,
|
||||||
|
text=snippet_text,
|
||||||
|
formatting=formatting,
|
||||||
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
|
|
||||||
elif isinstance(element, marko.inline.CodeSpan):
|
elif isinstance(element, marko.inline.CodeSpan):
|
||||||
self._close_table(doc)
|
self._close_table(doc)
|
||||||
@ -353,7 +448,6 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
parent_item = doc.add_inline_group(parent=parent_item)
|
parent_item = doc.add_inline_group(parent=parent_item)
|
||||||
|
|
||||||
processed_block_types = (
|
processed_block_types = (
|
||||||
# marko.block.Heading,
|
|
||||||
marko.block.CodeBlock,
|
marko.block.CodeBlock,
|
||||||
marko.block.FencedCode,
|
marko.block.FencedCode,
|
||||||
marko.inline.RawText,
|
marko.inline.RawText,
|
||||||
@ -369,6 +463,8 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
depth=depth + 1,
|
depth=depth + 1,
|
||||||
doc=doc,
|
doc=doc,
|
||||||
visited=visited,
|
visited=visited,
|
||||||
|
creation_stack=creation_stack,
|
||||||
|
list_ordered_flag_by_ref=list_ordered_flag_by_ref,
|
||||||
parent_item=parent_item,
|
parent_item=parent_item,
|
||||||
formatting=formatting,
|
formatting=formatting,
|
||||||
hyperlink=hyperlink,
|
hyperlink=hyperlink,
|
||||||
@ -412,6 +508,8 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
doc=doc,
|
doc=doc,
|
||||||
parent_item=None,
|
parent_item=None,
|
||||||
visited=set(),
|
visited=set(),
|
||||||
|
creation_stack=[],
|
||||||
|
list_ordered_flag_by_ref={},
|
||||||
)
|
)
|
||||||
self._close_table(doc=doc) # handle any last hanging table
|
self._close_table(doc=doc) # handle any last hanging table
|
||||||
|
|
||||||
|
@ -121,7 +121,9 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
|
|
||||||
return prov
|
return prov
|
||||||
|
|
||||||
def handle_text_elements(self, shape, parent_slide, slide_ind, doc, slide_size):
|
def handle_text_elements(
|
||||||
|
self, shape, parent_slide, slide_ind, doc: DoclingDocument, slide_size
|
||||||
|
):
|
||||||
is_list_group_created = False
|
is_list_group_created = False
|
||||||
enum_list_item_value = 0
|
enum_list_item_value = 0
|
||||||
new_list = None
|
new_list = None
|
||||||
@ -165,10 +167,7 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
enumerated = bullet_type == "Numbered"
|
enumerated = bullet_type == "Numbered"
|
||||||
|
|
||||||
if not is_list_group_created:
|
if not is_list_group_created:
|
||||||
new_list = doc.add_group(
|
new_list = doc.add_list_group(
|
||||||
label=GroupLabel.ORDERED_LIST
|
|
||||||
if enumerated
|
|
||||||
else GroupLabel.LIST,
|
|
||||||
name="list",
|
name="list",
|
||||||
parent=parent_slide,
|
parent=parent_slide,
|
||||||
)
|
)
|
||||||
|
@ -10,11 +10,12 @@ from docling_core.types.doc import (
|
|||||||
DocumentOrigin,
|
DocumentOrigin,
|
||||||
GroupLabel,
|
GroupLabel,
|
||||||
ImageRef,
|
ImageRef,
|
||||||
|
ListGroup,
|
||||||
NodeItem,
|
NodeItem,
|
||||||
TableCell,
|
TableCell,
|
||||||
TableData,
|
TableData,
|
||||||
)
|
)
|
||||||
from docling_core.types.doc.document import Formatting, OrderedList, UnorderedList
|
from docling_core.types.doc.document import Formatting
|
||||||
from docx import Document
|
from docx import Document
|
||||||
from docx.document import Document as DocxDocument
|
from docx.document import Document as DocxDocument
|
||||||
from docx.oxml.table import CT_Tc
|
from docx.oxml.table import CT_Tc
|
||||||
@ -397,7 +398,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
if isinstance(c, Hyperlink):
|
if isinstance(c, Hyperlink):
|
||||||
text = c.text
|
text = c.text
|
||||||
hyperlink = Path(c.address)
|
hyperlink = Path(c.address)
|
||||||
format = self._get_format_from_run(c.runs[0])
|
format = (
|
||||||
|
self._get_format_from_run(c.runs[0])
|
||||||
|
if c.runs and len(c.runs) > 0
|
||||||
|
else None
|
||||||
|
)
|
||||||
elif isinstance(c, Run):
|
elif isinstance(c, Run):
|
||||||
text = c.text
|
text = c.text
|
||||||
hyperlink = None
|
hyperlink = None
|
||||||
@ -684,7 +689,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
paragraph_elements: list,
|
paragraph_elements: list,
|
||||||
) -> Optional[NodeItem]:
|
) -> Optional[NodeItem]:
|
||||||
return (
|
return (
|
||||||
doc.add_group(label=GroupLabel.INLINE, parent=prev_parent)
|
doc.add_inline_group(parent=prev_parent)
|
||||||
if len(paragraph_elements) > 1
|
if len(paragraph_elements) > 1
|
||||||
else prev_parent
|
else prev_parent
|
||||||
)
|
)
|
||||||
@ -777,9 +782,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
else:
|
else:
|
||||||
# Inline equation
|
# Inline equation
|
||||||
level = self._get_level()
|
level = self._get_level()
|
||||||
inline_equation = doc.add_group(
|
inline_equation = doc.add_inline_group(parent=self.parents[level - 1])
|
||||||
label=GroupLabel.INLINE, parent=self.parents[level - 1]
|
|
||||||
)
|
|
||||||
text_tmp = text
|
text_tmp = text
|
||||||
for eq in equations:
|
for eq in equations:
|
||||||
if len(text_tmp) == 0:
|
if len(text_tmp) == 0:
|
||||||
@ -927,18 +930,22 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
level: int,
|
level: int,
|
||||||
) -> None:
|
) -> None:
|
||||||
# This should not happen by construction
|
# This should not happen by construction
|
||||||
if not isinstance(self.parents[level], (OrderedList, UnorderedList)):
|
if not isinstance(self.parents[level], ListGroup):
|
||||||
return
|
return
|
||||||
|
if not elements:
|
||||||
|
return
|
||||||
|
|
||||||
if len(elements) == 1:
|
if len(elements) == 1:
|
||||||
text, format, hyperlink = elements[0]
|
text, format, hyperlink = elements[0]
|
||||||
doc.add_list_item(
|
if text:
|
||||||
marker=marker,
|
doc.add_list_item(
|
||||||
enumerated=enumerated,
|
marker=marker,
|
||||||
parent=self.parents[level],
|
enumerated=enumerated,
|
||||||
text=text,
|
parent=self.parents[level],
|
||||||
formatting=format,
|
text=text,
|
||||||
hyperlink=hyperlink,
|
formatting=format,
|
||||||
)
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
new_item = doc.add_list_item(
|
new_item = doc.add_list_item(
|
||||||
marker=marker,
|
marker=marker,
|
||||||
@ -946,15 +953,16 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
parent=self.parents[level],
|
parent=self.parents[level],
|
||||||
text="",
|
text="",
|
||||||
)
|
)
|
||||||
new_parent = doc.add_group(label=GroupLabel.INLINE, parent=new_item)
|
new_parent = doc.add_inline_group(parent=new_item)
|
||||||
for text, format, hyperlink in elements:
|
for text, format, hyperlink in elements:
|
||||||
doc.add_text(
|
if text:
|
||||||
label=DocItemLabel.TEXT,
|
doc.add_text(
|
||||||
parent=new_parent,
|
label=DocItemLabel.TEXT,
|
||||||
text=text,
|
parent=new_parent,
|
||||||
formatting=format,
|
text=text,
|
||||||
hyperlink=hyperlink,
|
formatting=format,
|
||||||
)
|
hyperlink=hyperlink,
|
||||||
|
)
|
||||||
|
|
||||||
def _add_list_item(
|
def _add_list_item(
|
||||||
self,
|
self,
|
||||||
@ -975,8 +983,8 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
if self._prev_numid() is None: # Open new list
|
if self._prev_numid() is None: # Open new list
|
||||||
self.level_at_new_list = level
|
self.level_at_new_list = level
|
||||||
|
|
||||||
self.parents[level] = doc.add_group(
|
self.parents[level] = doc.add_list_group(
|
||||||
label=GroupLabel.LIST, name="list", parent=self.parents[level - 1]
|
name="list", parent=self.parents[level - 1]
|
||||||
)
|
)
|
||||||
|
|
||||||
# Set marker and enumerated arguments if this is an enumeration element.
|
# Set marker and enumerated arguments if this is an enumeration element.
|
||||||
@ -997,19 +1005,10 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.level_at_new_list + prev_indent + 1,
|
self.level_at_new_list + prev_indent + 1,
|
||||||
self.level_at_new_list + ilevel + 1,
|
self.level_at_new_list + ilevel + 1,
|
||||||
):
|
):
|
||||||
# Determine if this is an unordered list or an ordered list.
|
|
||||||
# Set GroupLabel.ORDERED_LIST when it fits.
|
|
||||||
self.listIter = 0
|
self.listIter = 0
|
||||||
if is_numbered:
|
self.parents[i] = doc.add_list_group(
|
||||||
self.parents[i] = doc.add_group(
|
name="list", parent=self.parents[i - 1]
|
||||||
label=GroupLabel.ORDERED_LIST,
|
)
|
||||||
name="list",
|
|
||||||
parent=self.parents[i - 1],
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
self.parents[i] = doc.add_group(
|
|
||||||
label=GroupLabel.LIST, name="list", parent=self.parents[i - 1]
|
|
||||||
)
|
|
||||||
|
|
||||||
# TODO: Set marker and enumerated arguments if this is an enumeration element.
|
# TODO: Set marker and enumerated arguments if this is an enumeration element.
|
||||||
self.listIter += 1
|
self.listIter += 1
|
||||||
|
51
docling/backend/noop_backend.py
Normal file
51
docling/backend/noop_backend.py
Normal file
@ -0,0 +1,51 @@
|
|||||||
|
import logging
|
||||||
|
from io import BytesIO
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Set, Union
|
||||||
|
|
||||||
|
from docling.backend.abstract_backend import AbstractDocumentBackend
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.document import InputDocument
|
||||||
|
|
||||||
|
_log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class NoOpBackend(AbstractDocumentBackend):
|
||||||
|
"""
|
||||||
|
A no-op backend that only validates input existence.
|
||||||
|
Used e.g. for audio files where actual processing is handled by the ASR pipeline.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
||||||
|
super().__init__(in_doc, path_or_stream)
|
||||||
|
|
||||||
|
_log.debug(f"NoOpBackend initialized for: {path_or_stream}")
|
||||||
|
|
||||||
|
# Validate input
|
||||||
|
try:
|
||||||
|
if isinstance(self.path_or_stream, BytesIO):
|
||||||
|
# Check if stream has content
|
||||||
|
self.valid = len(self.path_or_stream.getvalue()) > 0
|
||||||
|
_log.debug(
|
||||||
|
f"BytesIO stream length: {len(self.path_or_stream.getvalue())}"
|
||||||
|
)
|
||||||
|
elif isinstance(self.path_or_stream, Path):
|
||||||
|
# Check if file exists
|
||||||
|
self.valid = self.path_or_stream.exists()
|
||||||
|
_log.debug(f"File exists: {self.valid}")
|
||||||
|
else:
|
||||||
|
self.valid = False
|
||||||
|
except Exception as e:
|
||||||
|
_log.error(f"NoOpBackend validation failed: {e}")
|
||||||
|
self.valid = False
|
||||||
|
|
||||||
|
def is_valid(self) -> bool:
|
||||||
|
return self.valid
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def supports_pagination(cls) -> bool:
|
||||||
|
return False
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def supported_formats(cls) -> Set[InputFormat]:
|
||||||
|
return set(InputFormat)
|
@ -29,6 +29,15 @@ from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBacke
|
|||||||
from docling.backend.pdf_backend import PdfDocumentBackend
|
from docling.backend.pdf_backend import PdfDocumentBackend
|
||||||
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
|
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
|
||||||
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
|
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
|
||||||
|
from docling.datamodel.asr_model_specs import (
|
||||||
|
WHISPER_BASE,
|
||||||
|
WHISPER_LARGE,
|
||||||
|
WHISPER_MEDIUM,
|
||||||
|
WHISPER_SMALL,
|
||||||
|
WHISPER_TINY,
|
||||||
|
WHISPER_TURBO,
|
||||||
|
AsrModelType,
|
||||||
|
)
|
||||||
from docling.datamodel.base_models import (
|
from docling.datamodel.base_models import (
|
||||||
ConversionStatus,
|
ConversionStatus,
|
||||||
FormatToExtensions,
|
FormatToExtensions,
|
||||||
@ -37,12 +46,14 @@ from docling.datamodel.base_models import (
|
|||||||
)
|
)
|
||||||
from docling.datamodel.document import ConversionResult
|
from docling.datamodel.document import ConversionResult
|
||||||
from docling.datamodel.pipeline_options import (
|
from docling.datamodel.pipeline_options import (
|
||||||
|
AsrPipelineOptions,
|
||||||
EasyOcrOptions,
|
EasyOcrOptions,
|
||||||
OcrOptions,
|
OcrOptions,
|
||||||
PaginatedPipelineOptions,
|
PaginatedPipelineOptions,
|
||||||
PdfBackend,
|
PdfBackend,
|
||||||
PdfPipeline,
|
|
||||||
PdfPipelineOptions,
|
PdfPipelineOptions,
|
||||||
|
PipelineOptions,
|
||||||
|
ProcessingPipeline,
|
||||||
TableFormerMode,
|
TableFormerMode,
|
||||||
VlmPipelineOptions,
|
VlmPipelineOptions,
|
||||||
)
|
)
|
||||||
@ -54,8 +65,14 @@ from docling.datamodel.vlm_model_specs import (
|
|||||||
SMOLDOCLING_TRANSFORMERS,
|
SMOLDOCLING_TRANSFORMERS,
|
||||||
VlmModelType,
|
VlmModelType,
|
||||||
)
|
)
|
||||||
from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption
|
from docling.document_converter import (
|
||||||
|
AudioFormatOption,
|
||||||
|
DocumentConverter,
|
||||||
|
FormatOption,
|
||||||
|
PdfFormatOption,
|
||||||
|
)
|
||||||
from docling.models.factories import get_ocr_factory
|
from docling.models.factories import get_ocr_factory
|
||||||
|
from docling.pipeline.asr_pipeline import AsrPipeline
|
||||||
from docling.pipeline.vlm_pipeline import VlmPipeline
|
from docling.pipeline.vlm_pipeline import VlmPipeline
|
||||||
|
|
||||||
warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
|
warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
|
||||||
@ -296,13 +313,17 @@ def convert( # noqa: C901
|
|||||||
),
|
),
|
||||||
] = ImageRefMode.EMBEDDED,
|
] = ImageRefMode.EMBEDDED,
|
||||||
pipeline: Annotated[
|
pipeline: Annotated[
|
||||||
PdfPipeline,
|
ProcessingPipeline,
|
||||||
typer.Option(..., help="Choose the pipeline to process PDF or image files."),
|
typer.Option(..., help="Choose the pipeline to process PDF or image files."),
|
||||||
] = PdfPipeline.STANDARD,
|
] = ProcessingPipeline.STANDARD,
|
||||||
vlm_model: Annotated[
|
vlm_model: Annotated[
|
||||||
VlmModelType,
|
VlmModelType,
|
||||||
typer.Option(..., help="Choose the VLM model to use with PDF or image files."),
|
typer.Option(..., help="Choose the VLM model to use with PDF or image files."),
|
||||||
] = VlmModelType.SMOLDOCLING,
|
] = VlmModelType.SMOLDOCLING,
|
||||||
|
asr_model: Annotated[
|
||||||
|
AsrModelType,
|
||||||
|
typer.Option(..., help="Choose the ASR model to use with audio/video files."),
|
||||||
|
] = AsrModelType.WHISPER_TINY,
|
||||||
ocr: Annotated[
|
ocr: Annotated[
|
||||||
bool,
|
bool,
|
||||||
typer.Option(
|
typer.Option(
|
||||||
@ -450,12 +471,14 @@ def convert( # noqa: C901
|
|||||||
),
|
),
|
||||||
] = None,
|
] = None,
|
||||||
):
|
):
|
||||||
|
log_format = "%(asctime)s\t%(levelname)s\t%(name)s: %(message)s"
|
||||||
|
|
||||||
if verbose == 0:
|
if verbose == 0:
|
||||||
logging.basicConfig(level=logging.WARNING)
|
logging.basicConfig(level=logging.WARNING, format=log_format)
|
||||||
elif verbose == 1:
|
elif verbose == 1:
|
||||||
logging.basicConfig(level=logging.INFO)
|
logging.basicConfig(level=logging.INFO, format=log_format)
|
||||||
else:
|
else:
|
||||||
logging.basicConfig(level=logging.DEBUG)
|
logging.basicConfig(level=logging.DEBUG, format=log_format)
|
||||||
|
|
||||||
settings.debug.visualize_cells = debug_visualize_cells
|
settings.debug.visualize_cells = debug_visualize_cells
|
||||||
settings.debug.visualize_layout = debug_visualize_layout
|
settings.debug.visualize_layout = debug_visualize_layout
|
||||||
@ -530,9 +553,12 @@ def convert( # noqa: C901
|
|||||||
ocr_options.lang = ocr_lang_list
|
ocr_options.lang = ocr_lang_list
|
||||||
|
|
||||||
accelerator_options = AcceleratorOptions(num_threads=num_threads, device=device)
|
accelerator_options = AcceleratorOptions(num_threads=num_threads, device=device)
|
||||||
pipeline_options: PaginatedPipelineOptions
|
# pipeline_options: PaginatedPipelineOptions
|
||||||
|
pipeline_options: PipelineOptions
|
||||||
|
|
||||||
if pipeline == PdfPipeline.STANDARD:
|
format_options: Dict[InputFormat, FormatOption] = {}
|
||||||
|
|
||||||
|
if pipeline == ProcessingPipeline.STANDARD:
|
||||||
pipeline_options = PdfPipelineOptions(
|
pipeline_options = PdfPipelineOptions(
|
||||||
allow_external_plugins=allow_external_plugins,
|
allow_external_plugins=allow_external_plugins,
|
||||||
enable_remote_services=enable_remote_services,
|
enable_remote_services=enable_remote_services,
|
||||||
@ -574,7 +600,13 @@ def convert( # noqa: C901
|
|||||||
pipeline_options=pipeline_options,
|
pipeline_options=pipeline_options,
|
||||||
backend=backend, # pdf_backend
|
backend=backend, # pdf_backend
|
||||||
)
|
)
|
||||||
elif pipeline == PdfPipeline.VLM:
|
|
||||||
|
format_options = {
|
||||||
|
InputFormat.PDF: pdf_format_option,
|
||||||
|
InputFormat.IMAGE: pdf_format_option,
|
||||||
|
}
|
||||||
|
|
||||||
|
elif pipeline == ProcessingPipeline.VLM:
|
||||||
pipeline_options = VlmPipelineOptions(
|
pipeline_options = VlmPipelineOptions(
|
||||||
enable_remote_services=enable_remote_services,
|
enable_remote_services=enable_remote_services,
|
||||||
)
|
)
|
||||||
@ -600,13 +632,48 @@ def convert( # noqa: C901
|
|||||||
pipeline_cls=VlmPipeline, pipeline_options=pipeline_options
|
pipeline_cls=VlmPipeline, pipeline_options=pipeline_options
|
||||||
)
|
)
|
||||||
|
|
||||||
|
format_options = {
|
||||||
|
InputFormat.PDF: pdf_format_option,
|
||||||
|
InputFormat.IMAGE: pdf_format_option,
|
||||||
|
}
|
||||||
|
|
||||||
|
elif pipeline == ProcessingPipeline.ASR:
|
||||||
|
pipeline_options = AsrPipelineOptions(
|
||||||
|
# enable_remote_services=enable_remote_services,
|
||||||
|
# artifacts_path = artifacts_path
|
||||||
|
)
|
||||||
|
|
||||||
|
if asr_model == AsrModelType.WHISPER_TINY:
|
||||||
|
pipeline_options.asr_options = WHISPER_TINY
|
||||||
|
elif asr_model == AsrModelType.WHISPER_SMALL:
|
||||||
|
pipeline_options.asr_options = WHISPER_SMALL
|
||||||
|
elif asr_model == AsrModelType.WHISPER_MEDIUM:
|
||||||
|
pipeline_options.asr_options = WHISPER_MEDIUM
|
||||||
|
elif asr_model == AsrModelType.WHISPER_BASE:
|
||||||
|
pipeline_options.asr_options = WHISPER_BASE
|
||||||
|
elif asr_model == AsrModelType.WHISPER_LARGE:
|
||||||
|
pipeline_options.asr_options = WHISPER_LARGE
|
||||||
|
elif asr_model == AsrModelType.WHISPER_TURBO:
|
||||||
|
pipeline_options.asr_options = WHISPER_TURBO
|
||||||
|
else:
|
||||||
|
_log.error(f"{asr_model} is not known")
|
||||||
|
raise ValueError(f"{asr_model} is not known")
|
||||||
|
|
||||||
|
_log.info(f"pipeline_options: {pipeline_options}")
|
||||||
|
|
||||||
|
audio_format_option = AudioFormatOption(
|
||||||
|
pipeline_cls=AsrPipeline,
|
||||||
|
pipeline_options=pipeline_options,
|
||||||
|
)
|
||||||
|
|
||||||
|
format_options = {
|
||||||
|
InputFormat.AUDIO: audio_format_option,
|
||||||
|
}
|
||||||
|
|
||||||
if artifacts_path is not None:
|
if artifacts_path is not None:
|
||||||
pipeline_options.artifacts_path = artifacts_path
|
pipeline_options.artifacts_path = artifacts_path
|
||||||
|
# audio_pipeline_options.artifacts_path = artifacts_path
|
||||||
|
|
||||||
format_options: Dict[InputFormat, FormatOption] = {
|
|
||||||
InputFormat.PDF: pdf_format_option,
|
|
||||||
InputFormat.IMAGE: pdf_format_option,
|
|
||||||
}
|
|
||||||
doc_converter = DocumentConverter(
|
doc_converter = DocumentConverter(
|
||||||
allowed_formats=from_formats,
|
allowed_formats=from_formats,
|
||||||
format_options=format_options,
|
format_options=format_options,
|
||||||
@ -614,6 +681,7 @@ def convert( # noqa: C901
|
|||||||
|
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
|
|
||||||
|
_log.info(f"paths: {input_doc_paths}")
|
||||||
conv_results = doc_converter.convert_all(
|
conv_results = doc_converter.convert_all(
|
||||||
input_doc_paths, headers=parsed_headers, raises_on_error=abort_on_error
|
input_doc_paths, headers=parsed_headers, raises_on_error=abort_on_error
|
||||||
)
|
)
|
||||||
|
92
docling/datamodel/asr_model_specs.py
Normal file
92
docling/datamodel/asr_model_specs.py
Normal file
@ -0,0 +1,92 @@
|
|||||||
|
import logging
|
||||||
|
from enum import Enum
|
||||||
|
|
||||||
|
from pydantic import (
|
||||||
|
AnyUrl,
|
||||||
|
)
|
||||||
|
|
||||||
|
from docling.datamodel.accelerator_options import AcceleratorDevice
|
||||||
|
from docling.datamodel.pipeline_options_asr_model import (
|
||||||
|
# AsrResponseFormat,
|
||||||
|
# ApiAsrOptions,
|
||||||
|
InferenceAsrFramework,
|
||||||
|
InlineAsrNativeWhisperOptions,
|
||||||
|
TransformersModelType,
|
||||||
|
)
|
||||||
|
|
||||||
|
_log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
WHISPER_TINY = InlineAsrNativeWhisperOptions(
|
||||||
|
repo_id="tiny",
|
||||||
|
inference_framework=InferenceAsrFramework.WHISPER,
|
||||||
|
verbose=True,
|
||||||
|
timestamps=True,
|
||||||
|
word_timestamps=True,
|
||||||
|
temperatue=0.0,
|
||||||
|
max_new_tokens=256,
|
||||||
|
max_time_chunk=30.0,
|
||||||
|
)
|
||||||
|
|
||||||
|
WHISPER_SMALL = InlineAsrNativeWhisperOptions(
|
||||||
|
repo_id="small",
|
||||||
|
inference_framework=InferenceAsrFramework.WHISPER,
|
||||||
|
verbose=True,
|
||||||
|
timestamps=True,
|
||||||
|
word_timestamps=True,
|
||||||
|
temperatue=0.0,
|
||||||
|
max_new_tokens=256,
|
||||||
|
max_time_chunk=30.0,
|
||||||
|
)
|
||||||
|
|
||||||
|
WHISPER_MEDIUM = InlineAsrNativeWhisperOptions(
|
||||||
|
repo_id="medium",
|
||||||
|
inference_framework=InferenceAsrFramework.WHISPER,
|
||||||
|
verbose=True,
|
||||||
|
timestamps=True,
|
||||||
|
word_timestamps=True,
|
||||||
|
temperatue=0.0,
|
||||||
|
max_new_tokens=256,
|
||||||
|
max_time_chunk=30.0,
|
||||||
|
)
|
||||||
|
|
||||||
|
WHISPER_BASE = InlineAsrNativeWhisperOptions(
|
||||||
|
repo_id="base",
|
||||||
|
inference_framework=InferenceAsrFramework.WHISPER,
|
||||||
|
verbose=True,
|
||||||
|
timestamps=True,
|
||||||
|
word_timestamps=True,
|
||||||
|
temperatue=0.0,
|
||||||
|
max_new_tokens=256,
|
||||||
|
max_time_chunk=30.0,
|
||||||
|
)
|
||||||
|
|
||||||
|
WHISPER_LARGE = InlineAsrNativeWhisperOptions(
|
||||||
|
repo_id="large",
|
||||||
|
inference_framework=InferenceAsrFramework.WHISPER,
|
||||||
|
verbose=True,
|
||||||
|
timestamps=True,
|
||||||
|
word_timestamps=True,
|
||||||
|
temperatue=0.0,
|
||||||
|
max_new_tokens=256,
|
||||||
|
max_time_chunk=30.0,
|
||||||
|
)
|
||||||
|
|
||||||
|
WHISPER_TURBO = InlineAsrNativeWhisperOptions(
|
||||||
|
repo_id="turbo",
|
||||||
|
inference_framework=InferenceAsrFramework.WHISPER,
|
||||||
|
verbose=True,
|
||||||
|
timestamps=True,
|
||||||
|
word_timestamps=True,
|
||||||
|
temperatue=0.0,
|
||||||
|
max_new_tokens=256,
|
||||||
|
max_time_chunk=30.0,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class AsrModelType(str, Enum):
|
||||||
|
WHISPER_TINY = "whisper_tiny"
|
||||||
|
WHISPER_SMALL = "whisper_small"
|
||||||
|
WHISPER_MEDIUM = "whisper_medium"
|
||||||
|
WHISPER_BASE = "whisper_base"
|
||||||
|
WHISPER_LARGE = "whisper_large"
|
||||||
|
WHISPER_TURBO = "whisper_turbo"
|
@ -49,6 +49,7 @@ class InputFormat(str, Enum):
|
|||||||
XML_USPTO = "xml_uspto"
|
XML_USPTO = "xml_uspto"
|
||||||
XML_JATS = "xml_jats"
|
XML_JATS = "xml_jats"
|
||||||
JSON_DOCLING = "json_docling"
|
JSON_DOCLING = "json_docling"
|
||||||
|
AUDIO = "audio"
|
||||||
|
|
||||||
|
|
||||||
class OutputFormat(str, Enum):
|
class OutputFormat(str, Enum):
|
||||||
@ -73,6 +74,7 @@ FormatToExtensions: Dict[InputFormat, List[str]] = {
|
|||||||
InputFormat.XLSX: ["xlsx", "xlsm"],
|
InputFormat.XLSX: ["xlsx", "xlsm"],
|
||||||
InputFormat.XML_USPTO: ["xml", "txt"],
|
InputFormat.XML_USPTO: ["xml", "txt"],
|
||||||
InputFormat.JSON_DOCLING: ["json"],
|
InputFormat.JSON_DOCLING: ["json"],
|
||||||
|
InputFormat.AUDIO: ["wav", "mp3"],
|
||||||
}
|
}
|
||||||
|
|
||||||
FormatToMimeType: Dict[InputFormat, List[str]] = {
|
FormatToMimeType: Dict[InputFormat, List[str]] = {
|
||||||
@ -104,6 +106,7 @@ FormatToMimeType: Dict[InputFormat, List[str]] = {
|
|||||||
],
|
],
|
||||||
InputFormat.XML_USPTO: ["application/xml", "text/plain"],
|
InputFormat.XML_USPTO: ["application/xml", "text/plain"],
|
||||||
InputFormat.JSON_DOCLING: ["application/json"],
|
InputFormat.JSON_DOCLING: ["application/json"],
|
||||||
|
InputFormat.AUDIO: ["audio/x-wav", "audio/mpeg", "audio/wav", "audio/mp3"],
|
||||||
}
|
}
|
||||||
|
|
||||||
MimeTypeToFormat: dict[str, list[InputFormat]] = {
|
MimeTypeToFormat: dict[str, list[InputFormat]] = {
|
||||||
@ -298,7 +301,7 @@ class OpenAiChatMessage(BaseModel):
|
|||||||
class OpenAiResponseChoice(BaseModel):
|
class OpenAiResponseChoice(BaseModel):
|
||||||
index: int
|
index: int
|
||||||
message: OpenAiChatMessage
|
message: OpenAiChatMessage
|
||||||
finish_reason: str
|
finish_reason: Optional[str]
|
||||||
|
|
||||||
|
|
||||||
class OpenAiResponseUsage(BaseModel):
|
class OpenAiResponseUsage(BaseModel):
|
||||||
|
@ -249,7 +249,7 @@ class _DocumentConversionInput(BaseModel):
|
|||||||
backend: Type[AbstractDocumentBackend]
|
backend: Type[AbstractDocumentBackend]
|
||||||
if format not in format_options.keys():
|
if format not in format_options.keys():
|
||||||
_log.error(
|
_log.error(
|
||||||
f"Input document {obj.name} does not match any allowed format."
|
f"Input document {obj.name} with format {format} does not match any allowed format: ({format_options.keys()})"
|
||||||
)
|
)
|
||||||
backend = _DummyBackend
|
backend = _DummyBackend
|
||||||
else:
|
else:
|
||||||
@ -318,6 +318,8 @@ class _DocumentConversionInput(BaseModel):
|
|||||||
mime = mime or _DocumentConversionInput._detect_csv(content)
|
mime = mime or _DocumentConversionInput._detect_csv(content)
|
||||||
mime = mime or "text/plain"
|
mime = mime or "text/plain"
|
||||||
formats = MimeTypeToFormat.get(mime, [])
|
formats = MimeTypeToFormat.get(mime, [])
|
||||||
|
_log.info(f"detected formats: {formats}")
|
||||||
|
|
||||||
if formats:
|
if formats:
|
||||||
if len(formats) == 1 and mime not in ("text/plain"):
|
if len(formats) == 1 and mime not in ("text/plain"):
|
||||||
return formats[0]
|
return formats[0]
|
||||||
|
@ -11,8 +11,13 @@ from pydantic import (
|
|||||||
)
|
)
|
||||||
from typing_extensions import deprecated
|
from typing_extensions import deprecated
|
||||||
|
|
||||||
|
from docling.datamodel import asr_model_specs
|
||||||
|
|
||||||
# Import the following for backwards compatibility
|
# Import the following for backwards compatibility
|
||||||
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
|
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
|
||||||
|
from docling.datamodel.pipeline_options_asr_model import (
|
||||||
|
InlineAsrOptions,
|
||||||
|
)
|
||||||
from docling.datamodel.pipeline_options_vlm_model import (
|
from docling.datamodel.pipeline_options_vlm_model import (
|
||||||
ApiVlmOptions,
|
ApiVlmOptions,
|
||||||
InferenceFramework,
|
InferenceFramework,
|
||||||
@ -202,7 +207,7 @@ smolvlm_picture_description = PictureDescriptionVlmOptions(
|
|||||||
|
|
||||||
# GraniteVision
|
# GraniteVision
|
||||||
granite_picture_description = PictureDescriptionVlmOptions(
|
granite_picture_description = PictureDescriptionVlmOptions(
|
||||||
repo_id="ibm-granite/granite-vision-3.1-2b-preview",
|
repo_id="ibm-granite/granite-vision-3.2-2b-preview",
|
||||||
prompt="What is shown in this image?",
|
prompt="What is shown in this image?",
|
||||||
)
|
)
|
||||||
|
|
||||||
@ -260,6 +265,11 @@ class VlmPipelineOptions(PaginatedPipelineOptions):
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class AsrPipelineOptions(PipelineOptions):
|
||||||
|
asr_options: Union[InlineAsrOptions] = asr_model_specs.WHISPER_TINY
|
||||||
|
artifacts_path: Optional[Union[Path, str]] = None
|
||||||
|
|
||||||
|
|
||||||
class PdfPipelineOptions(PaginatedPipelineOptions):
|
class PdfPipelineOptions(PaginatedPipelineOptions):
|
||||||
"""Options for the PDF pipeline."""
|
"""Options for the PDF pipeline."""
|
||||||
|
|
||||||
@ -297,6 +307,7 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
class PdfPipeline(str, Enum):
|
class ProcessingPipeline(str, Enum):
|
||||||
STANDARD = "standard"
|
STANDARD = "standard"
|
||||||
VLM = "vlm"
|
VLM = "vlm"
|
||||||
|
ASR = "asr"
|
||||||
|
57
docling/datamodel/pipeline_options_asr_model.py
Normal file
57
docling/datamodel/pipeline_options_asr_model.py
Normal file
@ -0,0 +1,57 @@
|
|||||||
|
from enum import Enum
|
||||||
|
from typing import Any, Dict, List, Literal, Optional, Union
|
||||||
|
|
||||||
|
from pydantic import AnyUrl, BaseModel
|
||||||
|
from typing_extensions import deprecated
|
||||||
|
|
||||||
|
from docling.datamodel.accelerator_options import AcceleratorDevice
|
||||||
|
from docling.datamodel.pipeline_options_vlm_model import (
|
||||||
|
# InferenceFramework,
|
||||||
|
TransformersModelType,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class BaseAsrOptions(BaseModel):
|
||||||
|
kind: str
|
||||||
|
# prompt: str
|
||||||
|
|
||||||
|
|
||||||
|
class InferenceAsrFramework(str, Enum):
|
||||||
|
# MLX = "mlx" # disabled for now
|
||||||
|
# TRANSFORMERS = "transformers" # disabled for now
|
||||||
|
WHISPER = "whisper"
|
||||||
|
|
||||||
|
|
||||||
|
class InlineAsrOptions(BaseAsrOptions):
|
||||||
|
kind: Literal["inline_model_options"] = "inline_model_options"
|
||||||
|
|
||||||
|
repo_id: str
|
||||||
|
|
||||||
|
verbose: bool = False
|
||||||
|
timestamps: bool = True
|
||||||
|
|
||||||
|
temperature: float = 0.0
|
||||||
|
max_new_tokens: int = 256
|
||||||
|
max_time_chunk: float = 30.0
|
||||||
|
|
||||||
|
torch_dtype: Optional[str] = None
|
||||||
|
supported_devices: List[AcceleratorDevice] = [
|
||||||
|
AcceleratorDevice.CPU,
|
||||||
|
AcceleratorDevice.CUDA,
|
||||||
|
AcceleratorDevice.MPS,
|
||||||
|
]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def repo_cache_folder(self) -> str:
|
||||||
|
return self.repo_id.replace("/", "--")
|
||||||
|
|
||||||
|
|
||||||
|
class InlineAsrNativeWhisperOptions(InlineAsrOptions):
|
||||||
|
inference_framework: InferenceAsrFramework = InferenceAsrFramework.WHISPER
|
||||||
|
|
||||||
|
language: str = "en"
|
||||||
|
supported_devices: List[AcceleratorDevice] = [
|
||||||
|
AcceleratorDevice.CPU,
|
||||||
|
AcceleratorDevice.CUDA,
|
||||||
|
]
|
||||||
|
word_timestamps: bool = True
|
@ -19,6 +19,7 @@ from docling.backend.md_backend import MarkdownDocumentBackend
|
|||||||
from docling.backend.msexcel_backend import MsExcelDocumentBackend
|
from docling.backend.msexcel_backend import MsExcelDocumentBackend
|
||||||
from docling.backend.mspowerpoint_backend import MsPowerpointDocumentBackend
|
from docling.backend.mspowerpoint_backend import MsPowerpointDocumentBackend
|
||||||
from docling.backend.msword_backend import MsWordDocumentBackend
|
from docling.backend.msword_backend import MsWordDocumentBackend
|
||||||
|
from docling.backend.noop_backend import NoOpBackend
|
||||||
from docling.backend.xml.jats_backend import JatsDocumentBackend
|
from docling.backend.xml.jats_backend import JatsDocumentBackend
|
||||||
from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend
|
from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend
|
||||||
from docling.datamodel.base_models import (
|
from docling.datamodel.base_models import (
|
||||||
@ -41,6 +42,7 @@ from docling.datamodel.settings import (
|
|||||||
settings,
|
settings,
|
||||||
)
|
)
|
||||||
from docling.exceptions import ConversionError
|
from docling.exceptions import ConversionError
|
||||||
|
from docling.pipeline.asr_pipeline import AsrPipeline
|
||||||
from docling.pipeline.base_pipeline import BasePipeline
|
from docling.pipeline.base_pipeline import BasePipeline
|
||||||
from docling.pipeline.simple_pipeline import SimplePipeline
|
from docling.pipeline.simple_pipeline import SimplePipeline
|
||||||
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
|
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
|
||||||
@ -118,6 +120,11 @@ class PdfFormatOption(FormatOption):
|
|||||||
backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend
|
backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend
|
||||||
|
|
||||||
|
|
||||||
|
class AudioFormatOption(FormatOption):
|
||||||
|
pipeline_cls: Type = AsrPipeline
|
||||||
|
backend: Type[AbstractDocumentBackend] = NoOpBackend
|
||||||
|
|
||||||
|
|
||||||
def _get_default_option(format: InputFormat) -> FormatOption:
|
def _get_default_option(format: InputFormat) -> FormatOption:
|
||||||
format_to_default_options = {
|
format_to_default_options = {
|
||||||
InputFormat.CSV: FormatOption(
|
InputFormat.CSV: FormatOption(
|
||||||
@ -156,6 +163,7 @@ def _get_default_option(format: InputFormat) -> FormatOption:
|
|||||||
InputFormat.JSON_DOCLING: FormatOption(
|
InputFormat.JSON_DOCLING: FormatOption(
|
||||||
pipeline_cls=SimplePipeline, backend=DoclingJSONBackend
|
pipeline_cls=SimplePipeline, backend=DoclingJSONBackend
|
||||||
),
|
),
|
||||||
|
InputFormat.AUDIO: FormatOption(pipeline_cls=AsrPipeline, backend=NoOpBackend),
|
||||||
}
|
}
|
||||||
if (options := format_to_default_options.get(format)) is not None:
|
if (options := format_to_default_options.get(format)) is not None:
|
||||||
return options
|
return options
|
||||||
|
253
docling/pipeline/asr_pipeline.py
Normal file
253
docling/pipeline/asr_pipeline.py
Normal file
@ -0,0 +1,253 @@
|
|||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
from io import BytesIO
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Optional, Union, cast
|
||||||
|
|
||||||
|
from docling_core.types.doc import DoclingDocument, DocumentOrigin
|
||||||
|
|
||||||
|
# import whisper # type: ignore
|
||||||
|
# import librosa
|
||||||
|
# import numpy as np
|
||||||
|
# import soundfile as sf # type: ignore
|
||||||
|
from docling_core.types.doc.labels import DocItemLabel
|
||||||
|
from pydantic import BaseModel, Field, validator
|
||||||
|
|
||||||
|
from docling.backend.abstract_backend import AbstractDocumentBackend
|
||||||
|
from docling.backend.noop_backend import NoOpBackend
|
||||||
|
|
||||||
|
# from pydub import AudioSegment # type: ignore
|
||||||
|
# from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline
|
||||||
|
from docling.datamodel.accelerator_options import (
|
||||||
|
AcceleratorOptions,
|
||||||
|
)
|
||||||
|
from docling.datamodel.base_models import (
|
||||||
|
ConversionStatus,
|
||||||
|
FormatToMimeType,
|
||||||
|
)
|
||||||
|
from docling.datamodel.document import ConversionResult, InputDocument
|
||||||
|
from docling.datamodel.pipeline_options import (
|
||||||
|
AsrPipelineOptions,
|
||||||
|
)
|
||||||
|
from docling.datamodel.pipeline_options_asr_model import (
|
||||||
|
InlineAsrNativeWhisperOptions,
|
||||||
|
# AsrResponseFormat,
|
||||||
|
InlineAsrOptions,
|
||||||
|
)
|
||||||
|
from docling.datamodel.pipeline_options_vlm_model import (
|
||||||
|
InferenceFramework,
|
||||||
|
)
|
||||||
|
from docling.datamodel.settings import settings
|
||||||
|
from docling.pipeline.base_pipeline import BasePipeline
|
||||||
|
from docling.utils.accelerator_utils import decide_device
|
||||||
|
from docling.utils.profiling import ProfilingScope, TimeRecorder
|
||||||
|
|
||||||
|
_log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class _ConversationWord(BaseModel):
|
||||||
|
text: str
|
||||||
|
start_time: Optional[float] = Field(
|
||||||
|
None, description="Start time in seconds from video start"
|
||||||
|
)
|
||||||
|
end_time: Optional[float] = Field(
|
||||||
|
None, ge=0, description="End time in seconds from video start"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class _ConversationItem(BaseModel):
|
||||||
|
text: str
|
||||||
|
start_time: Optional[float] = Field(
|
||||||
|
None, description="Start time in seconds from video start"
|
||||||
|
)
|
||||||
|
end_time: Optional[float] = Field(
|
||||||
|
None, ge=0, description="End time in seconds from video start"
|
||||||
|
)
|
||||||
|
speaker_id: Optional[int] = Field(None, description="Numeric speaker identifier")
|
||||||
|
speaker: Optional[str] = Field(
|
||||||
|
None, description="Speaker name, defaults to speaker-{speaker_id}"
|
||||||
|
)
|
||||||
|
words: Optional[list[_ConversationWord]] = Field(
|
||||||
|
None, description="Individual words with time-stamps"
|
||||||
|
)
|
||||||
|
|
||||||
|
def __lt__(self, other):
|
||||||
|
if not isinstance(other, _ConversationItem):
|
||||||
|
return NotImplemented
|
||||||
|
return self.start_time < other.start_time
|
||||||
|
|
||||||
|
def __eq__(self, other):
|
||||||
|
if not isinstance(other, _ConversationItem):
|
||||||
|
return NotImplemented
|
||||||
|
return self.start_time == other.start_time
|
||||||
|
|
||||||
|
def to_string(self) -> str:
|
||||||
|
"""Format the conversation entry as a string"""
|
||||||
|
result = ""
|
||||||
|
if (self.start_time is not None) and (self.end_time is not None):
|
||||||
|
result += f"[time: {self.start_time}-{self.end_time}] "
|
||||||
|
|
||||||
|
if self.speaker is not None:
|
||||||
|
result += f"[speaker:{self.speaker}] "
|
||||||
|
|
||||||
|
result += self.text
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
class _NativeWhisperModel:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
enabled: bool,
|
||||||
|
artifacts_path: Optional[Path],
|
||||||
|
accelerator_options: AcceleratorOptions,
|
||||||
|
asr_options: InlineAsrNativeWhisperOptions,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Transcriber using native Whisper.
|
||||||
|
"""
|
||||||
|
self.enabled = enabled
|
||||||
|
|
||||||
|
_log.info(f"artifacts-path: {artifacts_path}")
|
||||||
|
_log.info(f"accelerator_options: {accelerator_options}")
|
||||||
|
|
||||||
|
if self.enabled:
|
||||||
|
try:
|
||||||
|
import whisper # type: ignore
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
|
"whisper is not installed. Please install it via `pip install openai-whisper` or do `uv sync --extra asr`."
|
||||||
|
)
|
||||||
|
self.asr_options = asr_options
|
||||||
|
self.max_tokens = asr_options.max_new_tokens
|
||||||
|
self.temperature = asr_options.temperature
|
||||||
|
|
||||||
|
self.device = decide_device(
|
||||||
|
accelerator_options.device,
|
||||||
|
supported_devices=asr_options.supported_devices,
|
||||||
|
)
|
||||||
|
_log.info(f"Available device for Whisper: {self.device}")
|
||||||
|
|
||||||
|
self.model_name = asr_options.repo_id
|
||||||
|
_log.info(f"loading _NativeWhisperModel({self.model_name})")
|
||||||
|
if artifacts_path is not None:
|
||||||
|
_log.info(f"loading {self.model_name} from {artifacts_path}")
|
||||||
|
self.model = whisper.load_model(
|
||||||
|
name=self.model_name,
|
||||||
|
device=self.device,
|
||||||
|
download_root=str(artifacts_path),
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.model = whisper.load_model(
|
||||||
|
name=self.model_name, device=self.device
|
||||||
|
)
|
||||||
|
|
||||||
|
self.verbose = asr_options.verbose
|
||||||
|
self.timestamps = asr_options.timestamps
|
||||||
|
self.word_timestamps = asr_options.word_timestamps
|
||||||
|
|
||||||
|
def run(self, conv_res: ConversionResult) -> ConversionResult:
|
||||||
|
audio_path: Path = Path(conv_res.input.file).resolve()
|
||||||
|
|
||||||
|
try:
|
||||||
|
conversation = self.transcribe(audio_path)
|
||||||
|
|
||||||
|
# Ensure we have a proper DoclingDocument
|
||||||
|
origin = DocumentOrigin(
|
||||||
|
filename=conv_res.input.file.name or "audio.wav",
|
||||||
|
mimetype="audio/x-wav",
|
||||||
|
binary_hash=conv_res.input.document_hash,
|
||||||
|
)
|
||||||
|
conv_res.document = DoclingDocument(
|
||||||
|
name=conv_res.input.file.stem or "audio.wav", origin=origin
|
||||||
|
)
|
||||||
|
|
||||||
|
for citem in conversation:
|
||||||
|
conv_res.document.add_text(
|
||||||
|
label=DocItemLabel.TEXT, text=citem.to_string()
|
||||||
|
)
|
||||||
|
|
||||||
|
conv_res.status = ConversionStatus.SUCCESS
|
||||||
|
return conv_res
|
||||||
|
|
||||||
|
except Exception as exc:
|
||||||
|
_log.error(f"Audio tranciption has an error: {exc}")
|
||||||
|
|
||||||
|
conv_res.status = ConversionStatus.FAILURE
|
||||||
|
return conv_res
|
||||||
|
|
||||||
|
def transcribe(self, fpath: Path) -> list[_ConversationItem]:
|
||||||
|
result = self.model.transcribe(
|
||||||
|
str(fpath), verbose=self.verbose, word_timestamps=self.word_timestamps
|
||||||
|
)
|
||||||
|
|
||||||
|
convo: list[_ConversationItem] = []
|
||||||
|
for _ in result["segments"]:
|
||||||
|
item = _ConversationItem(
|
||||||
|
start_time=_["start"], end_time=_["end"], text=_["text"], words=[]
|
||||||
|
)
|
||||||
|
if "words" in _ and self.word_timestamps:
|
||||||
|
item.words = []
|
||||||
|
for __ in _["words"]:
|
||||||
|
item.words.append(
|
||||||
|
_ConversationWord(
|
||||||
|
start_time=__["start"],
|
||||||
|
end_time=__["end"],
|
||||||
|
text=__["word"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
convo.append(item)
|
||||||
|
|
||||||
|
return convo
|
||||||
|
|
||||||
|
|
||||||
|
class AsrPipeline(BasePipeline):
|
||||||
|
def __init__(self, pipeline_options: AsrPipelineOptions):
|
||||||
|
super().__init__(pipeline_options)
|
||||||
|
self.keep_backend = True
|
||||||
|
|
||||||
|
self.pipeline_options: AsrPipelineOptions = pipeline_options
|
||||||
|
|
||||||
|
artifacts_path: Optional[Path] = None
|
||||||
|
if pipeline_options.artifacts_path is not None:
|
||||||
|
artifacts_path = Path(pipeline_options.artifacts_path).expanduser()
|
||||||
|
elif settings.artifacts_path is not None:
|
||||||
|
artifacts_path = Path(settings.artifacts_path).expanduser()
|
||||||
|
|
||||||
|
if artifacts_path is not None and not artifacts_path.is_dir():
|
||||||
|
raise RuntimeError(
|
||||||
|
f"The value of {artifacts_path=} is not valid. "
|
||||||
|
"When defined, it must point to a folder containing all models required by the pipeline."
|
||||||
|
)
|
||||||
|
|
||||||
|
if isinstance(self.pipeline_options.asr_options, InlineAsrNativeWhisperOptions):
|
||||||
|
asr_options: InlineAsrNativeWhisperOptions = (
|
||||||
|
self.pipeline_options.asr_options
|
||||||
|
)
|
||||||
|
self._model = _NativeWhisperModel(
|
||||||
|
enabled=True, # must be always enabled for this pipeline to make sense.
|
||||||
|
artifacts_path=artifacts_path,
|
||||||
|
accelerator_options=pipeline_options.accelerator_options,
|
||||||
|
asr_options=asr_options,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
_log.error(f"No model support for {self.pipeline_options.asr_options}")
|
||||||
|
|
||||||
|
def _determine_status(self, conv_res: ConversionResult) -> ConversionStatus:
|
||||||
|
status = ConversionStatus.SUCCESS
|
||||||
|
return status
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_default_options(cls) -> AsrPipelineOptions:
|
||||||
|
return AsrPipelineOptions()
|
||||||
|
|
||||||
|
def _build_document(self, conv_res: ConversionResult) -> ConversionResult:
|
||||||
|
_log.info(f"start _build_document in AsrPipeline: {conv_res.input.file}")
|
||||||
|
with TimeRecorder(conv_res, "doc_build", scope=ProfilingScope.DOCUMENT):
|
||||||
|
self._model.run(conv_res=conv_res)
|
||||||
|
|
||||||
|
return conv_res
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def is_backend_supported(cls, backend: AbstractDocumentBackend):
|
||||||
|
return isinstance(backend, NoOpBackend)
|
56
docs/examples/minimal_asr_pipeline.py
vendored
Normal file
56
docs/examples/minimal_asr_pipeline.py
vendored
Normal file
@ -0,0 +1,56 @@
|
|||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from docling_core.types.doc import DoclingDocument
|
||||||
|
|
||||||
|
from docling.datamodel import asr_model_specs
|
||||||
|
from docling.datamodel.base_models import ConversionStatus, InputFormat
|
||||||
|
from docling.datamodel.document import ConversionResult
|
||||||
|
from docling.datamodel.pipeline_options import AsrPipelineOptions
|
||||||
|
from docling.document_converter import AudioFormatOption, DocumentConverter
|
||||||
|
from docling.pipeline.asr_pipeline import AsrPipeline
|
||||||
|
|
||||||
|
|
||||||
|
def get_asr_converter():
|
||||||
|
"""Create a DocumentConverter configured for ASR with whisper_turbo model."""
|
||||||
|
pipeline_options = AsrPipelineOptions()
|
||||||
|
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
|
||||||
|
|
||||||
|
converter = DocumentConverter(
|
||||||
|
format_options={
|
||||||
|
InputFormat.AUDIO: AudioFormatOption(
|
||||||
|
pipeline_cls=AsrPipeline,
|
||||||
|
pipeline_options=pipeline_options,
|
||||||
|
)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return converter
|
||||||
|
|
||||||
|
|
||||||
|
def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument:
|
||||||
|
"""ASR pipeline conversion using whisper_turbo"""
|
||||||
|
# Check if the test audio file exists
|
||||||
|
assert audio_path.exists(), f"Test audio file not found: {audio_path}"
|
||||||
|
|
||||||
|
converter = get_asr_converter()
|
||||||
|
|
||||||
|
# Convert the audio file
|
||||||
|
result: ConversionResult = converter.convert(audio_path)
|
||||||
|
|
||||||
|
# Verify conversion was successful
|
||||||
|
assert result.status == ConversionStatus.SUCCESS, (
|
||||||
|
f"Conversion failed with status: {result.status}"
|
||||||
|
)
|
||||||
|
return result.document
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
audio_path = Path("tests/data/audio/sample_10s.mp3")
|
||||||
|
|
||||||
|
doc = asr_pipeline_conversion(audio_path=audio_path)
|
||||||
|
print(doc.export_to_markdown())
|
||||||
|
|
||||||
|
# Expected output:
|
||||||
|
#
|
||||||
|
# [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde
|
||||||
|
#
|
||||||
|
# [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.
|
7
docs/index.md
vendored
7
docs/index.md
vendored
@ -20,14 +20,15 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
|||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
|
||||||
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||||
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||||
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
|
||||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||||
* 🔍 Extensive OCR support for scanned PDFs and images
|
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||||
* 🥚 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🔥
|
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
|
||||||
|
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models
|
||||||
* 💻 Simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
### Coming soon
|
### Coming soon
|
||||||
|
@ -80,6 +80,7 @@ nav:
|
|||||||
- "VLM pipeline with SmolDocling": examples/minimal_vlm_pipeline.py
|
- "VLM pipeline with SmolDocling": examples/minimal_vlm_pipeline.py
|
||||||
- "VLM pipeline with remote model": examples/vlm_pipeline_api_model.py
|
- "VLM pipeline with remote model": examples/vlm_pipeline_api_model.py
|
||||||
- "VLM comparison": examples/compare_vlm_models.py
|
- "VLM comparison": examples/compare_vlm_models.py
|
||||||
|
- "ASR pipeline with Whisper": examples/minimal_asr_pipeline.py
|
||||||
- "Figure export": examples/export_figures.py
|
- "Figure export": examples/export_figures.py
|
||||||
- "Table export": examples/export_tables.py
|
- "Table export": examples/export_tables.py
|
||||||
- "Multimodal export": examples/export_multimodal.py
|
- "Multimodal export": examples/export_multimodal.py
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
[project]
|
[project]
|
||||||
name = "docling"
|
name = "docling"
|
||||||
version = "2.37.0" # DO NOT EDIT, updated automatically
|
version = "2.39.0" # DO NOT EDIT, updated automatically
|
||||||
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
keywords = [
|
keywords = [
|
||||||
@ -44,7 +44,8 @@ authors = [
|
|||||||
requires-python = '>=3.9,<4.0'
|
requires-python = '>=3.9,<4.0'
|
||||||
dependencies = [
|
dependencies = [
|
||||||
'pydantic (>=2.0.0,<3.0.0)',
|
'pydantic (>=2.0.0,<3.0.0)',
|
||||||
'docling-core[chunking] (>=2.29.0,<3.0.0)',
|
'docling-core[chunking] (>=2.39.0,<3.0.0)',
|
||||||
|
'docling-ibm-models (>=3.4.4,<4.0.0)',
|
||||||
'docling-parse (>=4.0.0,<5.0.0)',
|
'docling-parse (>=4.0.0,<5.0.0)',
|
||||||
'docling-ibm-models (>=3.6.0,<4)',
|
'docling-ibm-models (>=3.6.0,<4)',
|
||||||
'filetype (>=1.2.0,<2.0.0)',
|
'filetype (>=1.2.0,<2.0.0)',
|
||||||
@ -99,6 +100,9 @@ rapidocr = [
|
|||||||
# 'onnxruntime (>=1.7.0,<2.0.0) ; python_version >= "3.10"',
|
# 'onnxruntime (>=1.7.0,<2.0.0) ; python_version >= "3.10"',
|
||||||
# 'onnxruntime (>=1.7.0,<1.20.0) ; python_version < "3.10"',
|
# 'onnxruntime (>=1.7.0,<1.20.0) ; python_version < "3.10"',
|
||||||
]
|
]
|
||||||
|
asr = [
|
||||||
|
"openai-whisper>=20240930",
|
||||||
|
]
|
||||||
|
|
||||||
[dependency-groups]
|
[dependency-groups]
|
||||||
dev = [
|
dev = [
|
||||||
@ -145,6 +149,9 @@ constraints = [
|
|||||||
package = true
|
package = true
|
||||||
default-groups = "all"
|
default-groups = "all"
|
||||||
|
|
||||||
|
[tool.uv.sources]
|
||||||
|
openai-whisper = { git = "https://github.com/openai/whisper.git", rev = "dd985ac4b90cafeef8712f2998d62c59c3e62d22" }
|
||||||
|
|
||||||
[tool.setuptools.packages.find]
|
[tool.setuptools.packages.find]
|
||||||
include = ["docling*"]
|
include = ["docling*"]
|
||||||
|
|
||||||
|
BIN
tests/data/audio/sample_10s.mp3
vendored
Normal file
BIN
tests/data/audio/sample_10s.mp3
vendored
Normal file
Binary file not shown.
@ -160,8 +160,8 @@
|
|||||||
<row_6><col_0><row_header>TableFormer</col_0><col_1><body>95.4</col_1><col_2><body>90.1</col_2><col_3><body>93.6</col_3></row_6>
|
<row_6><col_0><row_header>TableFormer</col_0><col_1><body>95.4</col_1><col_2><body>90.1</col_2><col_3><body>93.6</col_3></row_6>
|
||||||
</table>
|
</table>
|
||||||
<caption><location><page_7><loc_50><loc_13><loc_89><loc_17></location>Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.</caption>
|
<caption><location><page_7><loc_50><loc_13><loc_89><loc_17></location>Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.</caption>
|
||||||
<paragraph><location><page_8><loc_9><loc_89><loc_10><loc_90></location>- a.</paragraph>
|
<paragraph><location><page_8><loc_9><loc_89><loc_10><loc_90></location>a.</paragraph>
|
||||||
<paragraph><location><page_8><loc_11><loc_89><loc_82><loc_90></location>- Red - PDF cells, Green - predicted bounding boxes, Blue - post-processed predictions matched to PDF cells</paragraph>
|
<paragraph><location><page_8><loc_11><loc_89><loc_82><loc_90></location>Red - PDF cells, Green - predicted bounding boxes, Blue - post-processed predictions matched to PDF cells</paragraph>
|
||||||
<subtitle-level-1><location><page_8><loc_9><loc_87><loc_46><loc_88></location>Japanese language (previously unseen by TableFormer):</subtitle-level-1>
|
<subtitle-level-1><location><page_8><loc_9><loc_87><loc_46><loc_88></location>Japanese language (previously unseen by TableFormer):</subtitle-level-1>
|
||||||
<subtitle-level-1><location><page_8><loc_50><loc_87><loc_70><loc_88></location>Example table from FinTabNet:</subtitle-level-1>
|
<subtitle-level-1><location><page_8><loc_50><loc_87><loc_70><loc_88></location>Example table from FinTabNet:</subtitle-level-1>
|
||||||
<figure>
|
<figure>
|
||||||
@ -216,7 +216,7 @@
|
|||||||
<paragraph><location><page_8><loc_50><loc_18><loc_89><loc_35></location>In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce "SynthTabNet" a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.</paragraph>
|
<paragraph><location><page_8><loc_50><loc_18><loc_89><loc_35></location>In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce "SynthTabNet" a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.</paragraph>
|
||||||
<subtitle-level-1><location><page_8><loc_50><loc_14><loc_60><loc_15></location>References</subtitle-level-1>
|
<subtitle-level-1><location><page_8><loc_50><loc_14><loc_60><loc_15></location>References</subtitle-level-1>
|
||||||
<paragraph><location><page_8><loc_51><loc_10><loc_89><loc_12></location>[1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-</paragraph>
|
<paragraph><location><page_8><loc_51><loc_10><loc_89><loc_12></location>[1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-</paragraph>
|
||||||
<paragraph><location><page_9><loc_11><loc_85><loc_47><loc_90></location>- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</paragraph>
|
<paragraph><location><page_9><loc_11><loc_85><loc_47><loc_90></location>end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</paragraph>
|
||||||
<paragraph><location><page_9><loc_9><loc_81><loc_47><loc_85></location>[2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</paragraph>
|
<paragraph><location><page_9><loc_9><loc_81><loc_47><loc_85></location>[2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</paragraph>
|
||||||
<paragraph><location><page_9><loc_9><loc_77><loc_47><loc_81></location>[3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</paragraph>
|
<paragraph><location><page_9><loc_9><loc_77><loc_47><loc_81></location>[3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</paragraph>
|
||||||
<paragraph><location><page_9><loc_9><loc_71><loc_47><loc_76></location>[4] Herv´e D´ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</paragraph>
|
<paragraph><location><page_9><loc_9><loc_71><loc_47><loc_76></location>[4] Herv´e D´ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</paragraph>
|
||||||
@ -254,7 +254,7 @@
|
|||||||
<paragraph><location><page_10><loc_8><loc_20><loc_47><loc_25></location>[35] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4651-4659, 2016. 4</paragraph>
|
<paragraph><location><page_10><loc_8><loc_20><loc_47><loc_25></location>[35] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4651-4659, 2016. 4</paragraph>
|
||||||
<paragraph><location><page_10><loc_8><loc_13><loc_47><loc_19></location>[36] Xinyi Zheng, Doug Burdick, Lucian Popa, Peter Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. Winter Conference for Applications in Computer Vision (WACV) , 2021. 2, 3</paragraph>
|
<paragraph><location><page_10><loc_8><loc_13><loc_47><loc_19></location>[36] Xinyi Zheng, Doug Burdick, Lucian Popa, Peter Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. Winter Conference for Applications in Computer Vision (WACV) , 2021. 2, 3</paragraph>
|
||||||
<paragraph><location><page_10><loc_8><loc_10><loc_47><loc_12></location>[37] Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model,</paragraph>
|
<paragraph><location><page_10><loc_8><loc_10><loc_47><loc_12></location>[37] Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model,</paragraph>
|
||||||
<paragraph><location><page_10><loc_54><loc_85><loc_89><loc_90></location>- and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020 , pages 564-580, Cham, 2020. Springer International Publishing. 2, 3, 7</paragraph>
|
<paragraph><location><page_10><loc_54><loc_85><loc_89><loc_90></location>and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020 , pages 564-580, Cham, 2020. Springer International Publishing. 2, 3, 7</paragraph>
|
||||||
<paragraph><location><page_10><loc_50><loc_80><loc_89><loc_85></location>[38] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1015-1022, 2019. 1</paragraph>
|
<paragraph><location><page_10><loc_50><loc_80><loc_89><loc_85></location>[38] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1015-1022, 2019. 1</paragraph>
|
||||||
<subtitle-level-1><location><page_11><loc_22><loc_83><loc_76><loc_86></location>TableFormer: Table Structure Understanding with Transformers Supplementary Material</subtitle-level-1>
|
<subtitle-level-1><location><page_11><loc_22><loc_83><loc_76><loc_86></location>TableFormer: Table Structure Understanding with Transformers Supplementary Material</subtitle-level-1>
|
||||||
<subtitle-level-1><location><page_11><loc_8><loc_78><loc_29><loc_80></location>1. Details on the datasets</subtitle-level-1>
|
<subtitle-level-1><location><page_11><loc_8><loc_78><loc_29><loc_80></location>1. Details on the datasets</subtitle-level-1>
|
||||||
@ -285,7 +285,7 @@
|
|||||||
<paragraph><location><page_12><loc_8><loc_42><loc_47><loc_47></location>1. Get the minimal grid dimensions - number of rows and columns for the predicted table structure. This represents the most granular grid for the underlying table structure.</paragraph>
|
<paragraph><location><page_12><loc_8><loc_42><loc_47><loc_47></location>1. Get the minimal grid dimensions - number of rows and columns for the predicted table structure. This represents the most granular grid for the underlying table structure.</paragraph>
|
||||||
<paragraph><location><page_12><loc_8><loc_36><loc_47><loc_42></location>2. Generate pair-wise matches between the bounding boxes of the PDF cells and the predicted cells. The Intersection Over Union (IOU) metric is used to evaluate the quality of the matches.</paragraph>
|
<paragraph><location><page_12><loc_8><loc_36><loc_47><loc_42></location>2. Generate pair-wise matches between the bounding boxes of the PDF cells and the predicted cells. The Intersection Over Union (IOU) metric is used to evaluate the quality of the matches.</paragraph>
|
||||||
<paragraph><location><page_12><loc_8><loc_33><loc_47><loc_36></location>3. Use a carefully selected IOU threshold to designate the matches as "good" ones and "bad" ones.</paragraph>
|
<paragraph><location><page_12><loc_8><loc_33><loc_47><loc_36></location>3. Use a carefully selected IOU threshold to designate the matches as "good" ones and "bad" ones.</paragraph>
|
||||||
<paragraph><location><page_12><loc_8><loc_29><loc_47><loc_33></location>- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.</paragraph>
|
<paragraph><location><page_12><loc_8><loc_29><loc_47><loc_33></location>3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.</paragraph>
|
||||||
<paragraph><location><page_12><loc_8><loc_24><loc_47><loc_28></location>4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:</paragraph>
|
<paragraph><location><page_12><loc_8><loc_24><loc_47><loc_28></location>4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:</paragraph>
|
||||||
<paragraph><location><page_12><loc_8><loc_13><loc_47><loc_16></location>where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.</paragraph>
|
<paragraph><location><page_12><loc_8><loc_13><loc_47><loc_16></location>where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.</paragraph>
|
||||||
<paragraph><location><page_12><loc_8><loc_10><loc_47><loc_13></location>5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-</paragraph>
|
<paragraph><location><page_12><loc_8><loc_10><loc_47><loc_13></location>5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-</paragraph>
|
||||||
@ -294,10 +294,10 @@
|
|||||||
<paragraph><location><page_12><loc_50><loc_42><loc_89><loc_51></location>8. In some rare occasions, we have noticed that TableFormer can confuse a single column as two. When the postprocessing steps are applied, this results with two predicted columns pointing to the same PDF column. In such case we must de-duplicate the columns according to highest total column intersection score.</paragraph>
|
<paragraph><location><page_12><loc_50><loc_42><loc_89><loc_51></location>8. In some rare occasions, we have noticed that TableFormer can confuse a single column as two. When the postprocessing steps are applied, this results with two predicted columns pointing to the same PDF column. In such case we must de-duplicate the columns according to highest total column intersection score.</paragraph>
|
||||||
<paragraph><location><page_12><loc_50><loc_28><loc_89><loc_41></location>9. Pick up the remaining orphan cells. There could be cases, when after applying all the previous post-processing steps, some PDF cells could still remain without any match to predicted cells. However, it is still possible to deduce the correct matching for an orphan PDF cell by mapping its bounding box on the geometry of the grid. This mapping decides if the content of the orphan cell will be appended to an already matched table cell, or a new table cell should be created to match with the orphan.</paragraph>
|
<paragraph><location><page_12><loc_50><loc_28><loc_89><loc_41></location>9. Pick up the remaining orphan cells. There could be cases, when after applying all the previous post-processing steps, some PDF cells could still remain without any match to predicted cells. However, it is still possible to deduce the correct matching for an orphan PDF cell by mapping its bounding box on the geometry of the grid. This mapping decides if the content of the orphan cell will be appended to an already matched table cell, or a new table cell should be created to match with the orphan.</paragraph>
|
||||||
<paragraph><location><page_12><loc_50><loc_24><loc_89><loc_28></location>9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).</paragraph>
|
<paragraph><location><page_12><loc_50><loc_24><loc_89><loc_28></location>9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).</paragraph>
|
||||||
<paragraph><location><page_12><loc_50><loc_21><loc_89><loc_23></location>- 9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.</paragraph>
|
<paragraph><location><page_12><loc_50><loc_21><loc_89><loc_23></location>9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.</paragraph>
|
||||||
<paragraph><location><page_12><loc_50><loc_16><loc_89><loc_20></location>- 9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).</paragraph>
|
<paragraph><location><page_12><loc_50><loc_16><loc_89><loc_20></location>9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).</paragraph>
|
||||||
<paragraph><location><page_12><loc_50><loc_13><loc_89><loc_16></location>- 9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.</paragraph>
|
<paragraph><location><page_12><loc_50><loc_13><loc_89><loc_16></location>9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.</paragraph>
|
||||||
<paragraph><location><page_12><loc_50><loc_10><loc_89><loc_13></location>- 9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-</paragraph>
|
<paragraph><location><page_12><loc_50><loc_10><loc_89><loc_13></location>9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-</paragraph>
|
||||||
<paragraph><location><page_13><loc_8><loc_89><loc_15><loc_91></location>phan cell.</paragraph>
|
<paragraph><location><page_13><loc_8><loc_89><loc_15><loc_91></location>phan cell.</paragraph>
|
||||||
<paragraph><location><page_13><loc_8><loc_86><loc_47><loc_89></location>9f. Otherwise create a new structural cell and match it wit the orphan cell.</paragraph>
|
<paragraph><location><page_13><loc_8><loc_86><loc_47><loc_89></location>9f. Otherwise create a new structural cell and match it wit the orphan cell.</paragraph>
|
||||||
<paragraph><location><page_13><loc_8><loc_83><loc_47><loc_86></location>Aditional images with examples of TableFormer predictions and post-processing can be found below.</paragraph>
|
<paragraph><location><page_13><loc_8><loc_83><loc_47><loc_86></location>Aditional images with examples of TableFormer predictions and post-processing can be found below.</paragraph>
|
||||||
|
@ -2221,7 +2221,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- a.",
|
"text": "a.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -2244,7 +2244,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- Red - PDF cells, Green - predicted bounding boxes, Blue - post-processed predictions matched to PDF cells",
|
"text": "Red - PDF cells, Green - predicted bounding boxes, Blue - post-processed predictions matched to PDF cells",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -2578,7 +2578,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5",
|
"text": "end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -3452,7 +3452,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020 , pages 564-580, Cham, 2020. Springer International Publishing. 2, 3, 7",
|
"text": "and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020 , pages 564-580, Cham, 2020. Springer International Publishing. 2, 3, 7",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -4092,7 +4092,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.",
|
"text": "3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -4322,7 +4322,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- 9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.",
|
"text": "9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -4345,7 +4345,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- 9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).",
|
"text": "9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -4368,7 +4368,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- 9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.",
|
"text": "9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -4391,7 +4391,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- 9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-",
|
"text": "9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
|
@ -216,9 +216,9 @@ Table 4: Results of structure with content retrieved using cell detection on Pub
|
|||||||
| EDD | 91.2 | 85.4 | 88.3 |
|
| EDD | 91.2 | 85.4 | 88.3 |
|
||||||
| TableFormer | 95.4 | 90.1 | 93.6 |
|
| TableFormer | 95.4 | 90.1 | 93.6 |
|
||||||
|
|
||||||
- a.
|
a.
|
||||||
|
|
||||||
- Red - PDF cells, Green - predicted bounding boxes, Blue - post-processed predictions matched to PDF cells
|
Red - PDF cells, Green - predicted bounding boxes, Blue - post-processed predictions matched to PDF cells
|
||||||
|
|
||||||
## Japanese language (previously unseen by TableFormer):
|
## Japanese language (previously unseen by TableFormer):
|
||||||
|
|
||||||
@ -272,7 +272,7 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
|
|||||||
|
|
||||||
[1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
|
[1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
|
||||||
|
|
||||||
- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5
|
end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5
|
||||||
|
|
||||||
[2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3
|
[2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3
|
||||||
|
|
||||||
@ -348,7 +348,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
|
|||||||
|
|
||||||
[37] Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model,
|
[37] Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model,
|
||||||
|
|
||||||
- and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020 , pages 564-580, Cham, 2020. Springer International Publishing. 2, 3, 7
|
and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020 , pages 564-580, Cham, 2020. Springer International Publishing. 2, 3, 7
|
||||||
|
|
||||||
[38] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1015-1022, 2019. 1
|
[38] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1015-1022, 2019. 1
|
||||||
|
|
||||||
@ -403,7 +403,7 @@ Here is a step-by-step description of the prediction postprocessing:
|
|||||||
|
|
||||||
3. Use a carefully selected IOU threshold to designate the matches as "good" ones and "bad" ones.
|
3. Use a carefully selected IOU threshold to designate the matches as "good" ones and "bad" ones.
|
||||||
|
|
||||||
- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.
|
3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.
|
||||||
|
|
||||||
4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:
|
4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:
|
||||||
|
|
||||||
@ -421,13 +421,13 @@ where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for t
|
|||||||
|
|
||||||
9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).
|
9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).
|
||||||
|
|
||||||
- 9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.
|
9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.
|
||||||
|
|
||||||
- 9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).
|
9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).
|
||||||
|
|
||||||
- 9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.
|
9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.
|
||||||
|
|
||||||
- 9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-
|
9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-
|
||||||
|
|
||||||
phan cell.
|
phan cell.
|
||||||
|
|
||||||
|
@ -42,11 +42,11 @@
|
|||||||
<subtitle-level-1><location><page_6><loc_22><loc_40><loc_43><loc_41></location>4.1 Language Definition</subtitle-level-1>
|
<subtitle-level-1><location><page_6><loc_22><loc_40><loc_43><loc_41></location>4.1 Language Definition</subtitle-level-1>
|
||||||
<paragraph><location><page_6><loc_22><loc_34><loc_79><loc_38></location>In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines only 5 tokens that directly describe a tabular structure based on an atomic 2D grid.</paragraph>
|
<paragraph><location><page_6><loc_22><loc_34><loc_79><loc_38></location>In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines only 5 tokens that directly describe a tabular structure based on an atomic 2D grid.</paragraph>
|
||||||
<paragraph><location><page_6><loc_24><loc_33><loc_67><loc_34></location>The OTSL vocabulary is comprised of the following tokens:</paragraph>
|
<paragraph><location><page_6><loc_24><loc_33><loc_67><loc_34></location>The OTSL vocabulary is comprised of the following tokens:</paragraph>
|
||||||
<paragraph><location><page_6><loc_23><loc_30><loc_75><loc_31></location>- -"C" cell a new table cell that either has or does not have cell content</paragraph>
|
<paragraph><location><page_6><loc_23><loc_30><loc_75><loc_31></location>-"C" cell a new table cell that either has or does not have cell content</paragraph>
|
||||||
<paragraph><location><page_6><loc_23><loc_27><loc_79><loc_29></location>- -"L" cell left-looking cell , merging with the left neighbor cell to create a span</paragraph>
|
<paragraph><location><page_6><loc_23><loc_27><loc_79><loc_29></location>-"L" cell left-looking cell , merging with the left neighbor cell to create a span</paragraph>
|
||||||
<paragraph><location><page_6><loc_23><loc_24><loc_79><loc_26></location>- -"U" cell up-looking cell , merging with the upper neighbor cell to create a span</paragraph>
|
<paragraph><location><page_6><loc_23><loc_24><loc_79><loc_26></location>-"U" cell up-looking cell , merging with the upper neighbor cell to create a span</paragraph>
|
||||||
<paragraph><location><page_6><loc_23><loc_22><loc_74><loc_23></location>- -"X" cell cross cell , to merge with both left and upper neighbor cells</paragraph>
|
<paragraph><location><page_6><loc_23><loc_22><loc_74><loc_23></location>-"X" cell cross cell , to merge with both left and upper neighbor cells</paragraph>
|
||||||
<paragraph><location><page_6><loc_23><loc_20><loc_54><loc_21></location>- -"NL" new-line , switch to the next row.</paragraph>
|
<paragraph><location><page_6><loc_23><loc_20><loc_54><loc_21></location>-"NL" new-line , switch to the next row.</paragraph>
|
||||||
<paragraph><location><page_6><loc_22><loc_16><loc_79><loc_19></location>A notable attribute of OTSL is that it has the capability of achieving lossless conversion to HTML.</paragraph>
|
<paragraph><location><page_6><loc_22><loc_16><loc_79><loc_19></location>A notable attribute of OTSL is that it has the capability of achieving lossless conversion to HTML.</paragraph>
|
||||||
<figure>
|
<figure>
|
||||||
<location><page_7><loc_27><loc_65><loc_73><loc_79></location>
|
<location><page_7><loc_27><loc_65><loc_73><loc_79></location>
|
||||||
@ -58,7 +58,7 @@
|
|||||||
<paragraph><location><page_7><loc_23><loc_54><loc_79><loc_56></location>1. Left-looking cell rule : The left neighbour of an "L" cell must be either another "L" cell or a "C" cell.</paragraph>
|
<paragraph><location><page_7><loc_23><loc_54><loc_79><loc_56></location>1. Left-looking cell rule : The left neighbour of an "L" cell must be either another "L" cell or a "C" cell.</paragraph>
|
||||||
<paragraph><location><page_7><loc_23><loc_51><loc_79><loc_53></location>2. Up-looking cell rule : The upper neighbour of a "U" cell must be either another "U" cell or a "C" cell.</paragraph>
|
<paragraph><location><page_7><loc_23><loc_51><loc_79><loc_53></location>2. Up-looking cell rule : The upper neighbour of a "U" cell must be either another "U" cell or a "C" cell.</paragraph>
|
||||||
<subtitle-level-1><location><page_7><loc_23><loc_49><loc_37><loc_50></location>3. Cross cell rule :</subtitle-level-1>
|
<subtitle-level-1><location><page_7><loc_23><loc_49><loc_37><loc_50></location>3. Cross cell rule :</subtitle-level-1>
|
||||||
<paragraph><location><page_7><loc_25><loc_44><loc_79><loc_49></location>- The left neighbour of an "X" cell must be either another "X" cell or a "U" cell, and the upper neighbour of an "X" cell must be either another "X" cell or an "L" cell.</paragraph>
|
<paragraph><location><page_7><loc_25><loc_44><loc_79><loc_49></location>The left neighbour of an "X" cell must be either another "X" cell or a "U" cell, and the upper neighbour of an "X" cell must be either another "X" cell or an "L" cell.</paragraph>
|
||||||
<paragraph><location><page_7><loc_23><loc_43><loc_78><loc_44></location>4. First row rule : Only "L" cells and "C" cells are allowed in the first row.</paragraph>
|
<paragraph><location><page_7><loc_23><loc_43><loc_78><loc_44></location>4. First row rule : Only "L" cells and "C" cells are allowed in the first row.</paragraph>
|
||||||
<paragraph><location><page_7><loc_23><loc_40><loc_79><loc_43></location>5. First column rule : Only "U" cells and "C" cells are allowed in the first column.</paragraph>
|
<paragraph><location><page_7><loc_23><loc_40><loc_79><loc_43></location>5. First column rule : Only "U" cells and "C" cells are allowed in the first column.</paragraph>
|
||||||
<paragraph><location><page_7><loc_23><loc_37><loc_79><loc_40></location>6. Rectangular rule : The table representation is always rectangular - all rows must have an equal number of tokens, terminated with "NL" token.</paragraph>
|
<paragraph><location><page_7><loc_23><loc_37><loc_79><loc_40></location>6. Rectangular rule : The table representation is always rectangular - all rows must have an equal number of tokens, terminated with "NL" token.</paragraph>
|
||||||
|
@ -937,7 +937,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -\"C\" cell a new table cell that either has or does not have cell content",
|
"text": "-\"C\" cell a new table cell that either has or does not have cell content",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -960,7 +960,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -\"L\" cell left-looking cell , merging with the left neighbor cell to create a span",
|
"text": "-\"L\" cell left-looking cell , merging with the left neighbor cell to create a span",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -983,7 +983,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -\"U\" cell up-looking cell , merging with the upper neighbor cell to create a span",
|
"text": "-\"U\" cell up-looking cell , merging with the upper neighbor cell to create a span",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1006,7 +1006,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -\"X\" cell cross cell , to merge with both left and upper neighbor cells",
|
"text": "-\"X\" cell cross cell , to merge with both left and upper neighbor cells",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1029,7 +1029,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -\"NL\" new-line , switch to the next row.",
|
"text": "-\"NL\" new-line , switch to the next row.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1218,7 +1218,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- The left neighbour of an \"X\" cell must be either another \"X\" cell or a \"U\" cell, and the upper neighbour of an \"X\" cell must be either another \"X\" cell or an \"L\" cell.",
|
"text": "The left neighbour of an \"X\" cell must be either another \"X\" cell or a \"U\" cell, and the upper neighbour of an \"X\" cell must be either another \"X\" cell or an \"L\" cell.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
|
@ -70,15 +70,15 @@ In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines
|
|||||||
|
|
||||||
The OTSL vocabulary is comprised of the following tokens:
|
The OTSL vocabulary is comprised of the following tokens:
|
||||||
|
|
||||||
- -"C" cell a new table cell that either has or does not have cell content
|
-"C" cell a new table cell that either has or does not have cell content
|
||||||
|
|
||||||
- -"L" cell left-looking cell , merging with the left neighbor cell to create a span
|
-"L" cell left-looking cell , merging with the left neighbor cell to create a span
|
||||||
|
|
||||||
- -"U" cell up-looking cell , merging with the upper neighbor cell to create a span
|
-"U" cell up-looking cell , merging with the upper neighbor cell to create a span
|
||||||
|
|
||||||
- -"X" cell cross cell , to merge with both left and upper neighbor cells
|
-"X" cell cross cell , to merge with both left and upper neighbor cells
|
||||||
|
|
||||||
- -"NL" new-line , switch to the next row.
|
-"NL" new-line , switch to the next row.
|
||||||
|
|
||||||
A notable attribute of OTSL is that it has the capability of achieving lossless conversion to HTML.
|
A notable attribute of OTSL is that it has the capability of achieving lossless conversion to HTML.
|
||||||
|
|
||||||
@ -95,7 +95,7 @@ The OTSL representation follows these syntax rules:
|
|||||||
|
|
||||||
## 3. Cross cell rule :
|
## 3. Cross cell rule :
|
||||||
|
|
||||||
- The left neighbour of an "X" cell must be either another "X" cell or a "U" cell, and the upper neighbour of an "X" cell must be either another "X" cell or an "L" cell.
|
The left neighbour of an "X" cell must be either another "X" cell or a "U" cell, and the upper neighbour of an "X" cell must be either another "X" cell or an "L" cell.
|
||||||
|
|
||||||
4. First row rule : Only "L" cells and "C" cells are allowed in the first row.
|
4. First row rule : Only "L" cells and "C" cells are allowed in the first row.
|
||||||
|
|
||||||
|
@ -17,10 +17,10 @@
|
|||||||
<location><page_3><loc_23><loc_64><loc_29><loc_66></location>
|
<location><page_3><loc_23><loc_64><loc_29><loc_66></location>
|
||||||
</figure>
|
</figure>
|
||||||
<subtitle-level-1><location><page_3><loc_24><loc_57><loc_31><loc_59></location>Highlights</subtitle-level-1>
|
<subtitle-level-1><location><page_3><loc_24><loc_57><loc_31><loc_59></location>Highlights</subtitle-level-1>
|
||||||
<paragraph><location><page_3><loc_24><loc_55><loc_40><loc_56></location>- GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g81>GLYPH<g75>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g72>GLYPH<g3> GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g73>GLYPH<g82>GLYPH<g85>GLYPH<g80>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g92>GLYPH<g82>GLYPH<g88>GLYPH<g85> GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86></paragraph>
|
<paragraph><location><page_3><loc_24><loc_55><loc_40><loc_56></location>GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g81>GLYPH<g75>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g72>GLYPH<g3> GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g73>GLYPH<g82>GLYPH<g85>GLYPH<g80>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g92>GLYPH<g82>GLYPH<g88>GLYPH<g85> GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86></paragraph>
|
||||||
<paragraph><location><page_3><loc_24><loc_51><loc_42><loc_54></location>- GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g68>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g74>GLYPH<g85>GLYPH<g72>GLYPH<g68>GLYPH<g87>GLYPH<g72>GLYPH<g85>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g87>GLYPH<g88>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g55>GLYPH<g3> GLYPH<g83>GLYPH<g85>GLYPH<g82>GLYPH<g77>GLYPH<g72>GLYPH<g70>GLYPH<g87>GLYPH<g86> GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g85>GLYPH<g82>GLYPH<g88>GLYPH<g74>GLYPH<g75>GLYPH<g3> GLYPH<g80>GLYPH<g82>GLYPH<g71>GLYPH<g72>GLYPH<g85> GLYPH<g81>GLYPH<g76>GLYPH<g93>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71> GLYPH<g3> GLYPH<g68>GLYPH<g83>GLYPH<g83>GLYPH<g79>GLYPH<g76>GLYPH<g70>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86></paragraph>
|
<paragraph><location><page_3><loc_24><loc_51><loc_42><loc_54></location>GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g68>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g74>GLYPH<g85>GLYPH<g72>GLYPH<g68>GLYPH<g87>GLYPH<g72>GLYPH<g85>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g87>GLYPH<g88>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g55>GLYPH<g3> GLYPH<g83>GLYPH<g85>GLYPH<g82>GLYPH<g77>GLYPH<g72>GLYPH<g70>GLYPH<g87>GLYPH<g86> GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g85>GLYPH<g82>GLYPH<g88>GLYPH<g74>GLYPH<g75>GLYPH<g3> GLYPH<g80>GLYPH<g82>GLYPH<g71>GLYPH<g72>GLYPH<g85> GLYPH<g81>GLYPH<g76>GLYPH<g93>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71> GLYPH<g3> GLYPH<g68>GLYPH<g83>GLYPH<g83>GLYPH<g79>GLYPH<g76>GLYPH<g70>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86></paragraph>
|
||||||
<paragraph><location><page_3><loc_24><loc_48><loc_41><loc_50></location>- GLYPH<g115>GLYPH<g3> GLYPH<g53>GLYPH<g72>GLYPH<g79>GLYPH<g92>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g37>GLYPH<g48>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g3> GLYPH<g70>GLYPH<g82>GLYPH<g81>GLYPH<g86>GLYPH<g88>GLYPH<g79>GLYPH<g87>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g15>GLYPH<g3> GLYPH<g86>GLYPH<g78>GLYPH<g76>GLYPH<g79>GLYPH<g79>GLYPH<g86> GLYPH<g3> GLYPH<g86>GLYPH<g75>GLYPH<g68>GLYPH<g85>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g81>GLYPH<g82>GLYPH<g90>GLYPH<g81>GLYPH<g3> GLYPH<g86>GLYPH<g72>GLYPH<g85>GLYPH<g89>GLYPH<g76>GLYPH<g70>GLYPH<g72>GLYPH<g86></paragraph>
|
<paragraph><location><page_3><loc_24><loc_48><loc_41><loc_50></location>GLYPH<g115>GLYPH<g3> GLYPH<g53>GLYPH<g72>GLYPH<g79>GLYPH<g92>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g37>GLYPH<g48>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g3> GLYPH<g70>GLYPH<g82>GLYPH<g81>GLYPH<g86>GLYPH<g88>GLYPH<g79>GLYPH<g87>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g15>GLYPH<g3> GLYPH<g86>GLYPH<g78>GLYPH<g76>GLYPH<g79>GLYPH<g79>GLYPH<g86> GLYPH<g3> GLYPH<g86>GLYPH<g75>GLYPH<g68>GLYPH<g85>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g81>GLYPH<g82>GLYPH<g90>GLYPH<g81>GLYPH<g3> GLYPH<g86>GLYPH<g72>GLYPH<g85>GLYPH<g89>GLYPH<g76>GLYPH<g70>GLYPH<g72>GLYPH<g86></paragraph>
|
||||||
<paragraph><location><page_3><loc_24><loc_45><loc_38><loc_47></location>- GLYPH<g115>GLYPH<g3> GLYPH<g55> GLYPH<g68>GLYPH<g78>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g71>GLYPH<g89>GLYPH<g68>GLYPH<g81>GLYPH<g87>GLYPH<g68>GLYPH<g74>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g68>GLYPH<g70>GLYPH<g70>GLYPH<g72>GLYPH<g86>GLYPH<g86>GLYPH<g3> GLYPH<g87>GLYPH<g82>GLYPH<g3> GLYPH<g68> GLYPH<g3> GLYPH<g90>GLYPH<g82>GLYPH<g85>GLYPH<g79>GLYPH<g71>GLYPH<g90>GLYPH<g76>GLYPH<g71>GLYPH<g72>GLYPH<g3> GLYPH<g86>GLYPH<g82>GLYPH<g88>GLYPH<g85>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g76>GLYPH<g86>GLYPH<g72></paragraph>
|
<paragraph><location><page_3><loc_24><loc_45><loc_38><loc_47></location>GLYPH<g115>GLYPH<g3> GLYPH<g55> GLYPH<g68>GLYPH<g78>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g71>GLYPH<g89>GLYPH<g68>GLYPH<g81>GLYPH<g87>GLYPH<g68>GLYPH<g74>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g68>GLYPH<g70>GLYPH<g70>GLYPH<g72>GLYPH<g86>GLYPH<g86>GLYPH<g3> GLYPH<g87>GLYPH<g82>GLYPH<g3> GLYPH<g68> GLYPH<g3> GLYPH<g90>GLYPH<g82>GLYPH<g85>GLYPH<g79>GLYPH<g71>GLYPH<g90>GLYPH<g76>GLYPH<g71>GLYPH<g72>GLYPH<g3> GLYPH<g86>GLYPH<g82>GLYPH<g88>GLYPH<g85>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g76>GLYPH<g86>GLYPH<g72></paragraph>
|
||||||
<figure>
|
<figure>
|
||||||
<location><page_3><loc_10><loc_13><loc_42><loc_24></location>
|
<location><page_3><loc_10><loc_13><loc_42><loc_24></location>
|
||||||
</figure>
|
</figure>
|
||||||
@ -33,15 +33,15 @@
|
|||||||
<paragraph><location><page_3><loc_46><loc_46><loc_82><loc_52></location>With combined experiences and direct access to development groups, we're the experts in IBM DB2® for i. The DB2 for i Center of Excellence (CoE) can help you achieve-perhaps reexamine and exceed-your business requirements and gain more confidence and satisfaction in IBM product data management products and solutions.</paragraph>
|
<paragraph><location><page_3><loc_46><loc_46><loc_82><loc_52></location>With combined experiences and direct access to development groups, we're the experts in IBM DB2® for i. The DB2 for i Center of Excellence (CoE) can help you achieve-perhaps reexamine and exceed-your business requirements and gain more confidence and satisfaction in IBM product data management products and solutions.</paragraph>
|
||||||
<subtitle-level-1><location><page_3><loc_46><loc_44><loc_71><loc_45></location>Who we are, some of what we do</subtitle-level-1>
|
<subtitle-level-1><location><page_3><loc_46><loc_44><loc_71><loc_45></location>Who we are, some of what we do</subtitle-level-1>
|
||||||
<paragraph><location><page_3><loc_46><loc_42><loc_71><loc_43></location>Global CoE engagements cover topics including:</paragraph>
|
<paragraph><location><page_3><loc_46><loc_42><loc_71><loc_43></location>Global CoE engagements cover topics including:</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_40><loc_66><loc_41></location>- r Database performance and scalability</paragraph>
|
<paragraph><location><page_3><loc_46><loc_40><loc_66><loc_41></location>r Database performance and scalability</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_39><loc_69><loc_39></location>- r Advanced SQL knowledge and skills transfer</paragraph>
|
<paragraph><location><page_3><loc_46><loc_39><loc_69><loc_39></location>r Advanced SQL knowledge and skills transfer</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_37><loc_64><loc_38></location>- r Business intelligence and analytics</paragraph>
|
<paragraph><location><page_3><loc_46><loc_37><loc_64><loc_38></location>r Business intelligence and analytics</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_36><loc_56><loc_37></location>- r DB2 Web Query</paragraph>
|
<paragraph><location><page_3><loc_46><loc_36><loc_56><loc_37></location>r DB2 Web Query</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_35><loc_82><loc_36></location>- r Query/400 modernization for better reporting and analysis capabilities</paragraph>
|
<paragraph><location><page_3><loc_46><loc_35><loc_82><loc_36></location>r Query/400 modernization for better reporting and analysis capabilities</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_33><loc_69><loc_34></location>- r Database modernization and re-engineering</paragraph>
|
<paragraph><location><page_3><loc_46><loc_33><loc_69><loc_34></location>r Database modernization and re-engineering</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_32><loc_65><loc_33></location>- r Data-centric architecture and design</paragraph>
|
<paragraph><location><page_3><loc_46><loc_32><loc_65><loc_33></location>r Data-centric architecture and design</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_31><loc_76><loc_32></location>- r Extremely large database and overcoming limits to growth</paragraph>
|
<paragraph><location><page_3><loc_46><loc_31><loc_76><loc_32></location>r Extremely large database and overcoming limits to growth</paragraph>
|
||||||
<paragraph><location><page_3><loc_46><loc_30><loc_62><loc_30></location>- r ISV education and enablement</paragraph>
|
<paragraph><location><page_3><loc_46><loc_30><loc_62><loc_30></location>r ISV education and enablement</paragraph>
|
||||||
<subtitle-level-1><location><page_4><loc_11><loc_88><loc_25><loc_91></location>Preface</subtitle-level-1>
|
<subtitle-level-1><location><page_4><loc_11><loc_88><loc_25><loc_91></location>Preface</subtitle-level-1>
|
||||||
<paragraph><location><page_4><loc_22><loc_75><loc_89><loc_83></location>This IBMfi Redpaper™ publication provides information about the IBM i 7.2 feature of IBM DB2fi for i Row and Column Access Control (RCAC). It offers a broad description of the function and advantages of controlling access to data in a comprehensive and transparent way. This publication helps you understand the capabilities of RCAC and provides examples of defining, creating, and implementing the row permissions and column masks in a relational database environment.</paragraph>
|
<paragraph><location><page_4><loc_22><loc_75><loc_89><loc_83></location>This IBMfi Redpaper™ publication provides information about the IBM i 7.2 feature of IBM DB2fi for i Row and Column Access Control (RCAC). It offers a broad description of the function and advantages of controlling access to data in a comprehensive and transparent way. This publication helps you understand the capabilities of RCAC and provides examples of defining, creating, and implementing the row permissions and column masks in a relational database environment.</paragraph>
|
||||||
<paragraph><location><page_4><loc_22><loc_67><loc_89><loc_73></location>This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.</paragraph>
|
<paragraph><location><page_4><loc_22><loc_67><loc_89><loc_73></location>This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.</paragraph>
|
||||||
@ -64,15 +64,15 @@
|
|||||||
<paragraph><location><page_5><loc_22><loc_46><loc_89><loc_56></location>Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center$^{1}$ reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute$^{2}$ revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.</paragraph>
|
<paragraph><location><page_5><loc_22><loc_46><loc_89><loc_56></location>Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center$^{1}$ reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute$^{2}$ revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.</paragraph>
|
||||||
<paragraph><location><page_5><loc_22><loc_38><loc_86><loc_44></location>Businesses must make a serious effort to secure their data and recognize that securing information assets is a cost of doing business. In many parts of the world and in many industries, securing the data is required by law and subject to audits. Data security is no longer an option; it is a requirement.</paragraph>
|
<paragraph><location><page_5><loc_22><loc_38><loc_86><loc_44></location>Businesses must make a serious effort to secure their data and recognize that securing information assets is a cost of doing business. In many parts of the world and in many industries, securing the data is required by law and subject to audits. Data security is no longer an option; it is a requirement.</paragraph>
|
||||||
<paragraph><location><page_5><loc_22><loc_34><loc_89><loc_37></location>This chapter describes how you can secure and protect data in DB2 for i. The following topics are covered in this chapter:</paragraph>
|
<paragraph><location><page_5><loc_22><loc_34><loc_89><loc_37></location>This chapter describes how you can secure and protect data in DB2 for i. The following topics are covered in this chapter:</paragraph>
|
||||||
<paragraph><location><page_5><loc_22><loc_32><loc_41><loc_33></location>- GLYPH<SM590000> Security fundamentals</paragraph>
|
<paragraph><location><page_5><loc_22><loc_32><loc_41><loc_33></location>GLYPH<SM590000> Security fundamentals</paragraph>
|
||||||
<paragraph><location><page_5><loc_22><loc_30><loc_46><loc_32></location>- GLYPH<SM590000> Current state of IBM i security</paragraph>
|
<paragraph><location><page_5><loc_22><loc_30><loc_46><loc_32></location>GLYPH<SM590000> Current state of IBM i security</paragraph>
|
||||||
<paragraph><location><page_5><loc_22><loc_29><loc_43><loc_30></location>- GLYPH<SM590000> DB2 for i security controls</paragraph>
|
<paragraph><location><page_5><loc_22><loc_29><loc_43><loc_30></location>GLYPH<SM590000> DB2 for i security controls</paragraph>
|
||||||
<subtitle-level-1><location><page_6><loc_11><loc_89><loc_44><loc_91></location>1.1 Security fundamentals</subtitle-level-1>
|
<subtitle-level-1><location><page_6><loc_11><loc_89><loc_44><loc_91></location>1.1 Security fundamentals</subtitle-level-1>
|
||||||
<paragraph><location><page_6><loc_22><loc_84><loc_89><loc_87></location>Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:</paragraph>
|
<paragraph><location><page_6><loc_22><loc_84><loc_89><loc_87></location>Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:</paragraph>
|
||||||
<paragraph><location><page_6><loc_22><loc_77><loc_89><loc_83></location>- GLYPH<SM590000> First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.</paragraph>
|
<paragraph><location><page_6><loc_22><loc_77><loc_89><loc_83></location>GLYPH<SM590000> First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.</paragraph>
|
||||||
<paragraph><location><page_6><loc_25><loc_66><loc_89><loc_76></location>- The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.</paragraph>
|
<paragraph><location><page_6><loc_25><loc_66><loc_89><loc_76></location>The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.</paragraph>
|
||||||
<paragraph><location><page_6><loc_25><loc_64><loc_89><loc_65></location>A security policy is what defines whether the system and its settings are secure (or not).</paragraph>
|
<paragraph><location><page_6><loc_25><loc_64><loc_89><loc_65></location>A security policy is what defines whether the system and its settings are secure (or not).</paragraph>
|
||||||
<paragraph><location><page_6><loc_22><loc_53><loc_89><loc_63></location>- GLYPH<SM590000> The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.</paragraph>
|
<paragraph><location><page_6><loc_22><loc_53><loc_89><loc_63></location>GLYPH<SM590000> The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.</paragraph>
|
||||||
<paragraph><location><page_6><loc_22><loc_48><loc_87><loc_51></location>With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.</paragraph>
|
<paragraph><location><page_6><loc_22><loc_48><loc_87><loc_51></location>With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.</paragraph>
|
||||||
<subtitle-level-1><location><page_6><loc_11><loc_43><loc_53><loc_45></location>1.2 Current state of IBM i security</subtitle-level-1>
|
<subtitle-level-1><location><page_6><loc_11><loc_43><loc_53><loc_45></location>1.2 Current state of IBM i security</subtitle-level-1>
|
||||||
<paragraph><location><page_6><loc_22><loc_35><loc_89><loc_41></location>Because of the inherently secure nature of IBM i, many clients rely on the default system settings to protect their business data that is stored in DB2 for i. In most cases, this means no data protection because the default setting for the Create default public authority (QCRTAUT) system value is *CHANGE.</paragraph>
|
<paragraph><location><page_6><loc_22><loc_35><loc_89><loc_41></location>Because of the inherently secure nature of IBM i, many clients rely on the default system settings to protect their business data that is stored in DB2 for i. In most cases, this means no data protection because the default setting for the Create default public authority (QCRTAUT) system value is *CHANGE.</paragraph>
|
||||||
@ -90,9 +90,9 @@
|
|||||||
<caption><location><page_7><loc_22><loc_12><loc_52><loc_13></location>Figure 1-2 Existing row and column controls</caption>
|
<caption><location><page_7><loc_22><loc_12><loc_52><loc_13></location>Figure 1-2 Existing row and column controls</caption>
|
||||||
<subtitle-level-1><location><page_8><loc_11><loc_89><loc_55><loc_91></location>2.1.6 Change Function Usage CL command</subtitle-level-1>
|
<subtitle-level-1><location><page_8><loc_11><loc_89><loc_55><loc_91></location>2.1.6 Change Function Usage CL command</subtitle-level-1>
|
||||||
<paragraph><location><page_8><loc_22><loc_87><loc_89><loc_88></location>The following CL commands can be used to work with, display, or change function usage IDs:</paragraph>
|
<paragraph><location><page_8><loc_22><loc_87><loc_89><loc_88></location>The following CL commands can be used to work with, display, or change function usage IDs:</paragraph>
|
||||||
<paragraph><location><page_8><loc_22><loc_84><loc_49><loc_86></location>- GLYPH<SM590000> Work Function Usage ( WRKFCNUSG )</paragraph>
|
<paragraph><location><page_8><loc_22><loc_84><loc_49><loc_86></location>GLYPH<SM590000> Work Function Usage ( WRKFCNUSG )</paragraph>
|
||||||
<paragraph><location><page_8><loc_22><loc_83><loc_51><loc_84></location>- GLYPH<SM590000> Change Function Usage ( CHGFCNUSG )</paragraph>
|
<paragraph><location><page_8><loc_22><loc_83><loc_51><loc_84></location>GLYPH<SM590000> Change Function Usage ( CHGFCNUSG )</paragraph>
|
||||||
<paragraph><location><page_8><loc_22><loc_81><loc_51><loc_83></location>- GLYPH<SM590000> Display Function Usage ( DSPFCNUSG )</paragraph>
|
<paragraph><location><page_8><loc_22><loc_81><loc_51><loc_83></location>GLYPH<SM590000> Display Function Usage ( DSPFCNUSG )</paragraph>
|
||||||
<paragraph><location><page_8><loc_22><loc_77><loc_84><loc_80></location>For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:</paragraph>
|
<paragraph><location><page_8><loc_22><loc_77><loc_84><loc_80></location>For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:</paragraph>
|
||||||
<paragraph><location><page_8><loc_22><loc_75><loc_72><loc_76></location>CHGFCNUSG FCNID(QIBM_DB_SECADM) USER(HBEDOYA) USAGE(*ALLOWED)</paragraph>
|
<paragraph><location><page_8><loc_22><loc_75><loc_72><loc_76></location>CHGFCNUSG FCNID(QIBM_DB_SECADM) USER(HBEDOYA) USAGE(*ALLOWED)</paragraph>
|
||||||
<subtitle-level-1><location><page_8><loc_11><loc_71><loc_89><loc_72></location>2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view</subtitle-level-1>
|
<subtitle-level-1><location><page_8><loc_11><loc_71><loc_89><loc_72></location>2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view</subtitle-level-1>
|
||||||
@ -165,11 +165,11 @@
|
|||||||
</table>
|
</table>
|
||||||
<caption><location><page_11><loc_22><loc_87><loc_61><loc_88></location>Table 3-1 Special registers and their corresponding values</caption>
|
<caption><location><page_11><loc_22><loc_87><loc_61><loc_88></location>Table 3-1 Special registers and their corresponding values</caption>
|
||||||
<paragraph><location><page_11><loc_22><loc_70><loc_88><loc_73></location>Figure 3-5 shows the difference in the special register values when an adopted authority is used:</paragraph>
|
<paragraph><location><page_11><loc_22><loc_70><loc_88><loc_73></location>Figure 3-5 shows the difference in the special register values when an adopted authority is used:</paragraph>
|
||||||
<paragraph><location><page_11><loc_22><loc_68><loc_67><loc_69></location>- GLYPH<SM590000> A user connects to the server using the user profile ALICE.</paragraph>
|
<paragraph><location><page_11><loc_22><loc_68><loc_67><loc_69></location>GLYPH<SM590000> A user connects to the server using the user profile ALICE.</paragraph>
|
||||||
<paragraph><location><page_11><loc_22><loc_66><loc_74><loc_67></location>- GLYPH<SM590000> USER and CURRENT USER initially have the same value of ALICE.</paragraph>
|
<paragraph><location><page_11><loc_22><loc_66><loc_74><loc_67></location>GLYPH<SM590000> USER and CURRENT USER initially have the same value of ALICE.</paragraph>
|
||||||
<paragraph><location><page_11><loc_22><loc_62><loc_88><loc_65></location>- GLYPH<SM590000> ALICE calls an SQL procedure that is named proc1, which is owned by user profile JOE and was created to adopt JOE's authority when it is called.</paragraph>
|
<paragraph><location><page_11><loc_22><loc_62><loc_88><loc_65></location>GLYPH<SM590000> ALICE calls an SQL procedure that is named proc1, which is owned by user profile JOE and was created to adopt JOE's authority when it is called.</paragraph>
|
||||||
<paragraph><location><page_11><loc_22><loc_57><loc_89><loc_61></location>- GLYPH<SM590000> While the procedure is running, the special register USER still contains the value of ALICE because it excludes any adopted authority. The special register CURRENT USER contains the value of JOE because it includes any adopted authority.</paragraph>
|
<paragraph><location><page_11><loc_22><loc_57><loc_89><loc_61></location>GLYPH<SM590000> While the procedure is running, the special register USER still contains the value of ALICE because it excludes any adopted authority. The special register CURRENT USER contains the value of JOE because it includes any adopted authority.</paragraph>
|
||||||
<paragraph><location><page_11><loc_22><loc_53><loc_89><loc_56></location>- GLYPH<SM590000> When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.</paragraph>
|
<paragraph><location><page_11><loc_22><loc_53><loc_89><loc_56></location>GLYPH<SM590000> When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.</paragraph>
|
||||||
<figure>
|
<figure>
|
||||||
<location><page_11><loc_22><loc_25><loc_49><loc_51></location>
|
<location><page_11><loc_22><loc_25><loc_49><loc_51></location>
|
||||||
<caption>Figure 3-5 Special registers and adopted authority</caption>
|
<caption>Figure 3-5 Special registers and adopted authority</caption>
|
||||||
@ -206,11 +206,11 @@
|
|||||||
<paragraph><location><page_13><loc_22><loc_88><loc_26><loc_89></location>CASE</paragraph>
|
<paragraph><location><page_13><loc_22><loc_88><loc_26><loc_89></location>CASE</paragraph>
|
||||||
<paragraph><location><page_13><loc_22><loc_67><loc_85><loc_88></location>WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;</paragraph>
|
<paragraph><location><page_13><loc_22><loc_67><loc_85><loc_88></location>WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;</paragraph>
|
||||||
<paragraph><location><page_13><loc_22><loc_63><loc_89><loc_65></location>2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:</paragraph>
|
<paragraph><location><page_13><loc_22><loc_63><loc_89><loc_65></location>2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:</paragraph>
|
||||||
<paragraph><location><page_13><loc_25><loc_60><loc_77><loc_62></location>- -Human Resources can see the unmasked TAX_ID of the employees.</paragraph>
|
<paragraph><location><page_13><loc_25><loc_60><loc_77><loc_62></location>-Human Resources can see the unmasked TAX_ID of the employees.</paragraph>
|
||||||
<paragraph><location><page_13><loc_25><loc_58><loc_66><loc_59></location>- -Employees can see only their own unmasked TAX_ID.</paragraph>
|
<paragraph><location><page_13><loc_25><loc_58><loc_66><loc_59></location>-Employees can see only their own unmasked TAX_ID.</paragraph>
|
||||||
<paragraph><location><page_13><loc_25><loc_55><loc_89><loc_57></location>- -Managers see a masked version of TAX_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).</paragraph>
|
<paragraph><location><page_13><loc_25><loc_55><loc_89><loc_57></location>-Managers see a masked version of TAX_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).</paragraph>
|
||||||
<paragraph><location><page_13><loc_25><loc_52><loc_87><loc_54></location>- -Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.</paragraph>
|
<paragraph><location><page_13><loc_25><loc_52><loc_87><loc_54></location>-Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.</paragraph>
|
||||||
<paragraph><location><page_13><loc_25><loc_50><loc_87><loc_51></location>- To implement this column mask, run the SQL statement that is shown in Example 3-9.</paragraph>
|
<paragraph><location><page_13><loc_25><loc_50><loc_87><loc_51></location>To implement this column mask, run the SQL statement that is shown in Example 3-9.</paragraph>
|
||||||
<paragraph><location><page_13><loc_22><loc_14><loc_86><loc_47></location>CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;</paragraph>
|
<paragraph><location><page_13><loc_22><loc_14><loc_86><loc_47></location>CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;</paragraph>
|
||||||
<caption><location><page_13><loc_22><loc_48><loc_58><loc_49></location>Example 3-9 Creating a mask on the TAX_ID column</caption>
|
<caption><location><page_13><loc_22><loc_48><loc_58><loc_49></location>Example 3-9 Creating a mask on the TAX_ID column</caption>
|
||||||
<paragraph><location><page_14><loc_22><loc_90><loc_74><loc_91></location>3. Figure 3-10 shows the masks that are created in the HR_SCHEMA.</paragraph>
|
<paragraph><location><page_14><loc_22><loc_90><loc_74><loc_91></location>3. Figure 3-10 shows the masks that are created in the HR_SCHEMA.</paragraph>
|
||||||
@ -223,8 +223,8 @@
|
|||||||
<paragraph><location><page_14><loc_22><loc_67><loc_89><loc_71></location>Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:</paragraph>
|
<paragraph><location><page_14><loc_22><loc_67><loc_89><loc_71></location>Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:</paragraph>
|
||||||
<paragraph><location><page_14><loc_22><loc_65><loc_67><loc_66></location>1. Run the SQL statements that are shown in Example 3-10.</paragraph>
|
<paragraph><location><page_14><loc_22><loc_65><loc_67><loc_66></location>1. Run the SQL statements that are shown in Example 3-10.</paragraph>
|
||||||
<subtitle-level-1><location><page_14><loc_22><loc_62><loc_61><loc_63></location>Example 3-10 Activating RCAC on the EMPLOYEES table</subtitle-level-1>
|
<subtitle-level-1><location><page_14><loc_22><loc_62><loc_61><loc_63></location>Example 3-10 Activating RCAC on the EMPLOYEES table</subtitle-level-1>
|
||||||
<paragraph><location><page_14><loc_22><loc_60><loc_62><loc_61></location>- /* Active Row Access Control (permissions) */</paragraph>
|
<paragraph><location><page_14><loc_22><loc_60><loc_62><loc_61></location>/* Active Row Access Control (permissions) */</paragraph>
|
||||||
<paragraph><location><page_14><loc_22><loc_58><loc_58><loc_60></location>- /* Active Column Access Control (masks)</paragraph>
|
<paragraph><location><page_14><loc_22><loc_58><loc_58><loc_60></location>/* Active Column Access Control (masks)</paragraph>
|
||||||
<paragraph><location><page_14><loc_60><loc_58><loc_62><loc_60></location>*/</paragraph>
|
<paragraph><location><page_14><loc_60><loc_58><loc_62><loc_60></location>*/</paragraph>
|
||||||
<paragraph><location><page_14><loc_22><loc_57><loc_48><loc_58></location>ALTER TABLE HR_SCHEMA.EMPLOYEES</paragraph>
|
<paragraph><location><page_14><loc_22><loc_57><loc_48><loc_58></location>ALTER TABLE HR_SCHEMA.EMPLOYEES</paragraph>
|
||||||
<paragraph><location><page_14><loc_22><loc_55><loc_44><loc_56></location>ACTIVATE ROW ACCESS CONTROL</paragraph>
|
<paragraph><location><page_14><loc_22><loc_55><loc_44><loc_56></location>ACTIVATE ROW ACCESS CONTROL</paragraph>
|
||||||
|
@ -305,7 +305,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g81>GLYPH<g75>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g72>GLYPH<g3> GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g73>GLYPH<g82>GLYPH<g85>GLYPH<g80>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g92>GLYPH<g82>GLYPH<g88>GLYPH<g85> GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86>",
|
"text": "GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g81>GLYPH<g75>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g72>GLYPH<g3> GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g73>GLYPH<g82>GLYPH<g85>GLYPH<g80>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g92>GLYPH<g82>GLYPH<g88>GLYPH<g85> GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86>",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -328,7 +328,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g68>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g74>GLYPH<g85>GLYPH<g72>GLYPH<g68>GLYPH<g87>GLYPH<g72>GLYPH<g85>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g87>GLYPH<g88>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g55>GLYPH<g3> GLYPH<g83>GLYPH<g85>GLYPH<g82>GLYPH<g77>GLYPH<g72>GLYPH<g70>GLYPH<g87>GLYPH<g86> GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g85>GLYPH<g82>GLYPH<g88>GLYPH<g74>GLYPH<g75>GLYPH<g3> GLYPH<g80>GLYPH<g82>GLYPH<g71>GLYPH<g72>GLYPH<g85> GLYPH<g81>GLYPH<g76>GLYPH<g93>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71> GLYPH<g3> GLYPH<g68>GLYPH<g83>GLYPH<g83>GLYPH<g79>GLYPH<g76>GLYPH<g70>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86>",
|
"text": "GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g68>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g74>GLYPH<g85>GLYPH<g72>GLYPH<g68>GLYPH<g87>GLYPH<g72>GLYPH<g85>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g87>GLYPH<g88>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g55>GLYPH<g3> GLYPH<g83>GLYPH<g85>GLYPH<g82>GLYPH<g77>GLYPH<g72>GLYPH<g70>GLYPH<g87>GLYPH<g86> GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g85>GLYPH<g82>GLYPH<g88>GLYPH<g74>GLYPH<g75>GLYPH<g3> GLYPH<g80>GLYPH<g82>GLYPH<g71>GLYPH<g72>GLYPH<g85> GLYPH<g81>GLYPH<g76>GLYPH<g93>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71> GLYPH<g3> GLYPH<g68>GLYPH<g83>GLYPH<g83>GLYPH<g79>GLYPH<g76>GLYPH<g70>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86>",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -351,7 +351,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<g115>GLYPH<g3> GLYPH<g53>GLYPH<g72>GLYPH<g79>GLYPH<g92>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g37>GLYPH<g48>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g3> GLYPH<g70>GLYPH<g82>GLYPH<g81>GLYPH<g86>GLYPH<g88>GLYPH<g79>GLYPH<g87>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g15>GLYPH<g3> GLYPH<g86>GLYPH<g78>GLYPH<g76>GLYPH<g79>GLYPH<g79>GLYPH<g86> GLYPH<g3> GLYPH<g86>GLYPH<g75>GLYPH<g68>GLYPH<g85>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g81>GLYPH<g82>GLYPH<g90>GLYPH<g81>GLYPH<g3> GLYPH<g86>GLYPH<g72>GLYPH<g85>GLYPH<g89>GLYPH<g76>GLYPH<g70>GLYPH<g72>GLYPH<g86>",
|
"text": "GLYPH<g115>GLYPH<g3> GLYPH<g53>GLYPH<g72>GLYPH<g79>GLYPH<g92>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g37>GLYPH<g48>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g3> GLYPH<g70>GLYPH<g82>GLYPH<g81>GLYPH<g86>GLYPH<g88>GLYPH<g79>GLYPH<g87>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g15>GLYPH<g3> GLYPH<g86>GLYPH<g78>GLYPH<g76>GLYPH<g79>GLYPH<g79>GLYPH<g86> GLYPH<g3> GLYPH<g86>GLYPH<g75>GLYPH<g68>GLYPH<g85>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g81>GLYPH<g82>GLYPH<g90>GLYPH<g81>GLYPH<g3> GLYPH<g86>GLYPH<g72>GLYPH<g85>GLYPH<g89>GLYPH<g76>GLYPH<g70>GLYPH<g72>GLYPH<g86>",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -374,7 +374,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<g115>GLYPH<g3> GLYPH<g55> GLYPH<g68>GLYPH<g78>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g71>GLYPH<g89>GLYPH<g68>GLYPH<g81>GLYPH<g87>GLYPH<g68>GLYPH<g74>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g68>GLYPH<g70>GLYPH<g70>GLYPH<g72>GLYPH<g86>GLYPH<g86>GLYPH<g3> GLYPH<g87>GLYPH<g82>GLYPH<g3> GLYPH<g68> GLYPH<g3> GLYPH<g90>GLYPH<g82>GLYPH<g85>GLYPH<g79>GLYPH<g71>GLYPH<g90>GLYPH<g76>GLYPH<g71>GLYPH<g72>GLYPH<g3> GLYPH<g86>GLYPH<g82>GLYPH<g88>GLYPH<g85>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g76>GLYPH<g86>GLYPH<g72>",
|
"text": "GLYPH<g115>GLYPH<g3> GLYPH<g55> GLYPH<g68>GLYPH<g78>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g71>GLYPH<g89>GLYPH<g68>GLYPH<g81>GLYPH<g87>GLYPH<g68>GLYPH<g74>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g68>GLYPH<g70>GLYPH<g70>GLYPH<g72>GLYPH<g86>GLYPH<g86>GLYPH<g3> GLYPH<g87>GLYPH<g82>GLYPH<g3> GLYPH<g68> GLYPH<g3> GLYPH<g90>GLYPH<g82>GLYPH<g85>GLYPH<g79>GLYPH<g71>GLYPH<g90>GLYPH<g76>GLYPH<g71>GLYPH<g72>GLYPH<g3> GLYPH<g86>GLYPH<g82>GLYPH<g88>GLYPH<g85>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g76>GLYPH<g86>GLYPH<g72>",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -609,7 +609,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r Database performance and scalability",
|
"text": "r Database performance and scalability",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -632,7 +632,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r Advanced SQL knowledge and skills transfer",
|
"text": "r Advanced SQL knowledge and skills transfer",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -655,7 +655,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r Business intelligence and analytics",
|
"text": "r Business intelligence and analytics",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -678,7 +678,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r DB2 Web Query",
|
"text": "r DB2 Web Query",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -701,7 +701,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r Query/400 modernization for better reporting and analysis capabilities",
|
"text": "r Query/400 modernization for better reporting and analysis capabilities",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -724,7 +724,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r Database modernization and re-engineering",
|
"text": "r Database modernization and re-engineering",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -747,7 +747,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r Data-centric architecture and design",
|
"text": "r Data-centric architecture and design",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -770,7 +770,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r Extremely large database and overcoming limits to growth",
|
"text": "r Extremely large database and overcoming limits to growth",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -793,7 +793,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- r ISV education and enablement",
|
"text": "r ISV education and enablement",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1130,7 +1130,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> Security fundamentals",
|
"text": "GLYPH<SM590000> Security fundamentals",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1153,7 +1153,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> Current state of IBM i security",
|
"text": "GLYPH<SM590000> Current state of IBM i security",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1176,7 +1176,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> DB2 for i security controls",
|
"text": "GLYPH<SM590000> DB2 for i security controls",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1291,7 +1291,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.",
|
"text": "GLYPH<SM590000> First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1314,7 +1314,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.",
|
"text": "The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1360,7 +1360,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.",
|
"text": "GLYPH<SM590000> The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1687,7 +1687,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> Work Function Usage ( WRKFCNUSG )",
|
"text": "GLYPH<SM590000> Work Function Usage ( WRKFCNUSG )",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1710,7 +1710,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> Change Function Usage ( CHGFCNUSG )",
|
"text": "GLYPH<SM590000> Change Function Usage ( CHGFCNUSG )",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -1733,7 +1733,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> Display Function Usage ( DSPFCNUSG )",
|
"text": "GLYPH<SM590000> Display Function Usage ( DSPFCNUSG )",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -2558,7 +2558,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> A user connects to the server using the user profile ALICE.",
|
"text": "GLYPH<SM590000> A user connects to the server using the user profile ALICE.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -2581,7 +2581,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> USER and CURRENT USER initially have the same value of ALICE.",
|
"text": "GLYPH<SM590000> USER and CURRENT USER initially have the same value of ALICE.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -2604,7 +2604,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> ALICE calls an SQL procedure that is named proc1, which is owned by user profile JOE and was created to adopt JOE's authority when it is called.",
|
"text": "GLYPH<SM590000> ALICE calls an SQL procedure that is named proc1, which is owned by user profile JOE and was created to adopt JOE's authority when it is called.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -2627,7 +2627,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> While the procedure is running, the special register USER still contains the value of ALICE because it excludes any adopted authority. The special register CURRENT USER contains the value of JOE because it includes any adopted authority.",
|
"text": "GLYPH<SM590000> While the procedure is running, the special register USER still contains the value of ALICE because it excludes any adopted authority. The special register CURRENT USER contains the value of JOE because it includes any adopted authority.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -2650,7 +2650,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- GLYPH<SM590000> When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.",
|
"text": "GLYPH<SM590000> When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -3097,7 +3097,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -Human Resources can see the unmasked TAX_ID of the employees.",
|
"text": "-Human Resources can see the unmasked TAX_ID of the employees.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -3120,7 +3120,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -Employees can see only their own unmasked TAX_ID.",
|
"text": "-Employees can see only their own unmasked TAX_ID.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -3143,7 +3143,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -Managers see a masked version of TAX_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).",
|
"text": "-Managers see a masked version of TAX_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -3166,7 +3166,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- -Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.",
|
"text": "-Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -3189,7 +3189,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- To implement this column mask, run the SQL statement that is shown in Example 3-9.",
|
"text": "To implement this column mask, run the SQL statement that is shown in Example 3-9.",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -3401,7 +3401,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- /* Active Row Access Control (permissions) */",
|
"text": "/* Active Row Access Control (permissions) */",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
@ -3424,7 +3424,7 @@
|
|||||||
"__ref_s3_data": null
|
"__ref_s3_data": null
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"text": "- /* Active Column Access Control (masks)",
|
"text": "/* Active Column Access Control (masks)",
|
||||||
"type": "paragraph",
|
"type": "paragraph",
|
||||||
"payload": null,
|
"payload": null,
|
||||||
"name": "List-item",
|
"name": "List-item",
|
||||||
|
@ -18,13 +18,13 @@ Solution Brief IBM Systems Lab Services and Training
|
|||||||
|
|
||||||
## Highlights
|
## Highlights
|
||||||
|
|
||||||
- GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g81>GLYPH<g75>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g72>GLYPH<g3> GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g73>GLYPH<g82>GLYPH<g85>GLYPH<g80>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g92>GLYPH<g82>GLYPH<g88>GLYPH<g85> GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86>
|
GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g81>GLYPH<g75>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g72>GLYPH<g3> GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g73>GLYPH<g82>GLYPH<g85>GLYPH<g80>GLYPH<g68>GLYPH<g81>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g92>GLYPH<g82>GLYPH<g88>GLYPH<g85> GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86>
|
||||||
|
|
||||||
- GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g68>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g74>GLYPH<g85>GLYPH<g72>GLYPH<g68>GLYPH<g87>GLYPH<g72>GLYPH<g85>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g87>GLYPH<g88>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g55>GLYPH<g3> GLYPH<g83>GLYPH<g85>GLYPH<g82>GLYPH<g77>GLYPH<g72>GLYPH<g70>GLYPH<g87>GLYPH<g86> GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g85>GLYPH<g82>GLYPH<g88>GLYPH<g74>GLYPH<g75>GLYPH<g3> GLYPH<g80>GLYPH<g82>GLYPH<g71>GLYPH<g72>GLYPH<g85> GLYPH<g81>GLYPH<g76>GLYPH<g93>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71> GLYPH<g3> GLYPH<g68>GLYPH<g83>GLYPH<g83>GLYPH<g79>GLYPH<g76>GLYPH<g70>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86>
|
GLYPH<g115>GLYPH<g3> GLYPH<g40>GLYPH<g68>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g74>GLYPH<g85>GLYPH<g72>GLYPH<g68>GLYPH<g87>GLYPH<g72>GLYPH<g85>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g87>GLYPH<g88>GLYPH<g85> GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g55>GLYPH<g3> GLYPH<g83>GLYPH<g85>GLYPH<g82>GLYPH<g77>GLYPH<g72>GLYPH<g70>GLYPH<g87>GLYPH<g86> GLYPH<g3> GLYPH<g87>GLYPH<g75>GLYPH<g85>GLYPH<g82>GLYPH<g88>GLYPH<g74>GLYPH<g75>GLYPH<g3> GLYPH<g80>GLYPH<g82>GLYPH<g71>GLYPH<g72>GLYPH<g85> GLYPH<g81>GLYPH<g76>GLYPH<g93>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g71>GLYPH<g68>GLYPH<g87>GLYPH<g68>GLYPH<g69>GLYPH<g68>GLYPH<g86>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71> GLYPH<g3> GLYPH<g68>GLYPH<g83>GLYPH<g83>GLYPH<g79>GLYPH<g76>GLYPH<g70>GLYPH<g68>GLYPH<g87>GLYPH<g76>GLYPH<g82>GLYPH<g81>GLYPH<g86>
|
||||||
|
|
||||||
- GLYPH<g115>GLYPH<g3> GLYPH<g53>GLYPH<g72>GLYPH<g79>GLYPH<g92>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g37>GLYPH<g48>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g3> GLYPH<g70>GLYPH<g82>GLYPH<g81>GLYPH<g86>GLYPH<g88>GLYPH<g79>GLYPH<g87>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g15>GLYPH<g3> GLYPH<g86>GLYPH<g78>GLYPH<g76>GLYPH<g79>GLYPH<g79>GLYPH<g86> GLYPH<g3> GLYPH<g86>GLYPH<g75>GLYPH<g68>GLYPH<g85>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g81>GLYPH<g82>GLYPH<g90>GLYPH<g81>GLYPH<g3> GLYPH<g86>GLYPH<g72>GLYPH<g85>GLYPH<g89>GLYPH<g76>GLYPH<g70>GLYPH<g72>GLYPH<g86>
|
GLYPH<g115>GLYPH<g3> GLYPH<g53>GLYPH<g72>GLYPH<g79>GLYPH<g92>GLYPH<g3> GLYPH<g82>GLYPH<g81>GLYPH<g3> GLYPH<g44>GLYPH<g37>GLYPH<g48>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g3> GLYPH<g70>GLYPH<g82>GLYPH<g81>GLYPH<g86>GLYPH<g88>GLYPH<g79>GLYPH<g87>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g15>GLYPH<g3> GLYPH<g86>GLYPH<g78>GLYPH<g76>GLYPH<g79>GLYPH<g79>GLYPH<g86> GLYPH<g3> GLYPH<g86>GLYPH<g75>GLYPH<g68>GLYPH<g85>GLYPH<g76>GLYPH<g81>GLYPH<g74>GLYPH<g3> GLYPH<g68>GLYPH<g81>GLYPH<g71>GLYPH<g3> GLYPH<g85>GLYPH<g72>GLYPH<g81>GLYPH<g82>GLYPH<g90>GLYPH<g81>GLYPH<g3> GLYPH<g86>GLYPH<g72>GLYPH<g85>GLYPH<g89>GLYPH<g76>GLYPH<g70>GLYPH<g72>GLYPH<g86>
|
||||||
|
|
||||||
- GLYPH<g115>GLYPH<g3> GLYPH<g55> GLYPH<g68>GLYPH<g78>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g71>GLYPH<g89>GLYPH<g68>GLYPH<g81>GLYPH<g87>GLYPH<g68>GLYPH<g74>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g68>GLYPH<g70>GLYPH<g70>GLYPH<g72>GLYPH<g86>GLYPH<g86>GLYPH<g3> GLYPH<g87>GLYPH<g82>GLYPH<g3> GLYPH<g68> GLYPH<g3> GLYPH<g90>GLYPH<g82>GLYPH<g85>GLYPH<g79>GLYPH<g71>GLYPH<g90>GLYPH<g76>GLYPH<g71>GLYPH<g72>GLYPH<g3> GLYPH<g86>GLYPH<g82>GLYPH<g88>GLYPH<g85>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g76>GLYPH<g86>GLYPH<g72>
|
GLYPH<g115>GLYPH<g3> GLYPH<g55> GLYPH<g68>GLYPH<g78>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g71>GLYPH<g89>GLYPH<g68>GLYPH<g81>GLYPH<g87>GLYPH<g68>GLYPH<g74>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g68>GLYPH<g70>GLYPH<g70>GLYPH<g72>GLYPH<g86>GLYPH<g86>GLYPH<g3> GLYPH<g87>GLYPH<g82>GLYPH<g3> GLYPH<g68> GLYPH<g3> GLYPH<g90>GLYPH<g82>GLYPH<g85>GLYPH<g79>GLYPH<g71>GLYPH<g90>GLYPH<g76>GLYPH<g71>GLYPH<g72>GLYPH<g3> GLYPH<g86>GLYPH<g82>GLYPH<g88>GLYPH<g85>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g76>GLYPH<g86>GLYPH<g72>
|
||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
@ -46,23 +46,23 @@ With combined experiences and direct access to development groups, we're the exp
|
|||||||
|
|
||||||
Global CoE engagements cover topics including:
|
Global CoE engagements cover topics including:
|
||||||
|
|
||||||
- r Database performance and scalability
|
r Database performance and scalability
|
||||||
|
|
||||||
- r Advanced SQL knowledge and skills transfer
|
r Advanced SQL knowledge and skills transfer
|
||||||
|
|
||||||
- r Business intelligence and analytics
|
r Business intelligence and analytics
|
||||||
|
|
||||||
- r DB2 Web Query
|
r DB2 Web Query
|
||||||
|
|
||||||
- r Query/400 modernization for better reporting and analysis capabilities
|
r Query/400 modernization for better reporting and analysis capabilities
|
||||||
|
|
||||||
- r Database modernization and re-engineering
|
r Database modernization and re-engineering
|
||||||
|
|
||||||
- r Data-centric architecture and design
|
r Data-centric architecture and design
|
||||||
|
|
||||||
- r Extremely large database and overcoming limits to growth
|
r Extremely large database and overcoming limits to growth
|
||||||
|
|
||||||
- r ISV education and enablement
|
r ISV education and enablement
|
||||||
|
|
||||||
## Preface
|
## Preface
|
||||||
|
|
||||||
@ -96,23 +96,23 @@ Businesses must make a serious effort to secure their data and recognize that se
|
|||||||
|
|
||||||
This chapter describes how you can secure and protect data in DB2 for i. The following topics are covered in this chapter:
|
This chapter describes how you can secure and protect data in DB2 for i. The following topics are covered in this chapter:
|
||||||
|
|
||||||
- GLYPH<SM590000> Security fundamentals
|
GLYPH<SM590000> Security fundamentals
|
||||||
|
|
||||||
- GLYPH<SM590000> Current state of IBM i security
|
GLYPH<SM590000> Current state of IBM i security
|
||||||
|
|
||||||
- GLYPH<SM590000> DB2 for i security controls
|
GLYPH<SM590000> DB2 for i security controls
|
||||||
|
|
||||||
## 1.1 Security fundamentals
|
## 1.1 Security fundamentals
|
||||||
|
|
||||||
Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:
|
Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:
|
||||||
|
|
||||||
- GLYPH<SM590000> First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.
|
GLYPH<SM590000> First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.
|
||||||
|
|
||||||
- The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.
|
The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.
|
||||||
|
|
||||||
A security policy is what defines whether the system and its settings are secure (or not).
|
A security policy is what defines whether the system and its settings are secure (or not).
|
||||||
|
|
||||||
- GLYPH<SM590000> The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.
|
GLYPH<SM590000> The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.
|
||||||
|
|
||||||
With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.
|
With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.
|
||||||
|
|
||||||
@ -141,11 +141,11 @@ Figure 1-2 Existing row and column controls
|
|||||||
|
|
||||||
The following CL commands can be used to work with, display, or change function usage IDs:
|
The following CL commands can be used to work with, display, or change function usage IDs:
|
||||||
|
|
||||||
- GLYPH<SM590000> Work Function Usage ( WRKFCNUSG )
|
GLYPH<SM590000> Work Function Usage ( WRKFCNUSG )
|
||||||
|
|
||||||
- GLYPH<SM590000> Change Function Usage ( CHGFCNUSG )
|
GLYPH<SM590000> Change Function Usage ( CHGFCNUSG )
|
||||||
|
|
||||||
- GLYPH<SM590000> Display Function Usage ( DSPFCNUSG )
|
GLYPH<SM590000> Display Function Usage ( DSPFCNUSG )
|
||||||
|
|
||||||
For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:
|
For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:
|
||||||
|
|
||||||
@ -244,15 +244,15 @@ Table 3-1 Special registers and their corresponding values
|
|||||||
|
|
||||||
Figure 3-5 shows the difference in the special register values when an adopted authority is used:
|
Figure 3-5 shows the difference in the special register values when an adopted authority is used:
|
||||||
|
|
||||||
- GLYPH<SM590000> A user connects to the server using the user profile ALICE.
|
GLYPH<SM590000> A user connects to the server using the user profile ALICE.
|
||||||
|
|
||||||
- GLYPH<SM590000> USER and CURRENT USER initially have the same value of ALICE.
|
GLYPH<SM590000> USER and CURRENT USER initially have the same value of ALICE.
|
||||||
|
|
||||||
- GLYPH<SM590000> ALICE calls an SQL procedure that is named proc1, which is owned by user profile JOE and was created to adopt JOE's authority when it is called.
|
GLYPH<SM590000> ALICE calls an SQL procedure that is named proc1, which is owned by user profile JOE and was created to adopt JOE's authority when it is called.
|
||||||
|
|
||||||
- GLYPH<SM590000> While the procedure is running, the special register USER still contains the value of ALICE because it excludes any adopted authority. The special register CURRENT USER contains the value of JOE because it includes any adopted authority.
|
GLYPH<SM590000> While the procedure is running, the special register USER still contains the value of ALICE because it excludes any adopted authority. The special register CURRENT USER contains the value of JOE because it includes any adopted authority.
|
||||||
|
|
||||||
- GLYPH<SM590000> When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.
|
GLYPH<SM590000> When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.
|
||||||
|
|
||||||
Figure 3-5 Special registers and adopted authority
|
Figure 3-5 Special registers and adopted authority
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
@ -303,15 +303,15 @@ WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . D
|
|||||||
|
|
||||||
2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:
|
2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:
|
||||||
|
|
||||||
- -Human Resources can see the unmasked TAX_ID of the employees.
|
-Human Resources can see the unmasked TAX_ID of the employees.
|
||||||
|
|
||||||
- -Employees can see only their own unmasked TAX_ID.
|
-Employees can see only their own unmasked TAX_ID.
|
||||||
|
|
||||||
- -Managers see a masked version of TAX_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).
|
-Managers see a masked version of TAX_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).
|
||||||
|
|
||||||
- -Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.
|
-Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.
|
||||||
|
|
||||||
- To implement this column mask, run the SQL statement that is shown in Example 3-9.
|
To implement this column mask, run the SQL statement that is shown in Example 3-9.
|
||||||
|
|
||||||
CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;
|
CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;
|
||||||
|
|
||||||
@ -330,9 +330,9 @@ Now that you have created the row permission and the two column masks, RCAC must
|
|||||||
|
|
||||||
## Example 3-10 Activating RCAC on the EMPLOYEES table
|
## Example 3-10 Activating RCAC on the EMPLOYEES table
|
||||||
|
|
||||||
- /* Active Row Access Control (permissions) */
|
/* Active Row Access Control (permissions) */
|
||||||
|
|
||||||
- /* Active Column Access Control (masks)
|
/* Active Column Access Control (masks)
|
||||||
|
|
||||||
*/
|
*/
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "2203.01017v2",
|
"name": "2203.01017v2",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
@ -9249,7 +9249,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/284",
|
"self_ref": "#/texts/284",
|
||||||
@ -9280,7 +9280,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/285",
|
"self_ref": "#/texts/285",
|
||||||
@ -11348,7 +11348,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/356",
|
"self_ref": "#/texts/356",
|
||||||
@ -12553,7 +12553,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/395",
|
"self_ref": "#/texts/395",
|
||||||
@ -15148,7 +15148,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/483",
|
"self_ref": "#/texts/483",
|
||||||
@ -15452,7 +15452,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/493",
|
"self_ref": "#/texts/493",
|
||||||
@ -15483,7 +15483,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/494",
|
"self_ref": "#/texts/494",
|
||||||
@ -15514,7 +15514,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/495",
|
"self_ref": "#/texts/495",
|
||||||
@ -15545,7 +15545,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/496",
|
"self_ref": "#/texts/496",
|
||||||
|
108
tests/data/groundtruth/docling_v2/2203.01017v2.md
vendored
108
tests/data/groundtruth/docling_v2/2203.01017v2.md
vendored
@ -16,8 +16,8 @@ The occurrence of tables in documents is ubiquitous. They often summarise quanti
|
|||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
- Red-annotation of bounding boxes, Blue-predictions by TableFormer
|
- b. Red-annotation of bounding boxes, Blue-predictions by TableFormer
|
||||||
- Structure predicted by TableFormer:
|
- c. Structure predicted by TableFormer:
|
||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
@ -280,50 +280,50 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
|
|||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
|
- [1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
|
||||||
|
|
||||||
- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5
|
- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5
|
||||||
- Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3
|
- [2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3
|
||||||
- Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2
|
- [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2
|
||||||
- Herv´e D´ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
|
- [4] Herv´e D´ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
|
||||||
- Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2
|
- [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2
|
||||||
- Max G¨obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
|
- [6] Max G¨obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
|
||||||
- EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2
|
- [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2
|
||||||
- Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1
|
- [8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1
|
||||||
- Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1
|
- [9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1
|
||||||
- Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2
|
- [10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2
|
||||||
- Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2
|
- [11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2
|
||||||
- Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2
|
- [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2
|
||||||
- Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
|
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
|
||||||
- Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2
|
- [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2
|
||||||
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
|
- [15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
|
||||||
- Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4
|
- [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4
|
||||||
- Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition, 2019. 2, 3
|
- [17] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition, 2019. 2, 3
|
||||||
- Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction. In Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, editors, Pattern Recognition. ICPR International Workshops and Challenges , pages 644-658, Cham, 2021. Springer International Publishing. 2, 3
|
- [18] Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction. In Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, editors, Pattern Recognition. ICPR International Workshops and Challenges , pages 644-658, Cham, 2021. Springer International Publishing. 2, 3
|
||||||
- Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1
|
- [19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1
|
||||||
- Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2
|
- [20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2
|
||||||
- Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1
|
- [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1
|
||||||
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
|
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
|
||||||
- Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1
|
- [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1
|
||||||
- Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3
|
- [24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3
|
||||||
- Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on
|
- [25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on
|
||||||
|
|
||||||
Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
|
Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
|
||||||
|
|
||||||
- Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 11621167, 2017. 1
|
- [26] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 11621167, 2017. 1
|
||||||
- Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) , volume 1, pages 1162-1167. IEEE, 2017. 3
|
- [27] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) , volume 1, pages 1162-1167. IEEE, 2017. 3
|
||||||
- Faisal Shafait and Ray Smith. Table detection in heterogeneous documents. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , pages 6572, 2010. 2
|
- [28] Faisal Shafait and Ray Smith. Table detection in heterogeneous documents. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , pages 6572, 2010. 2
|
||||||
- Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3
|
- [29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3
|
||||||
- Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
|
- [30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
|
||||||
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5
|
- [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5
|
||||||
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015. 2
|
- [32] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015. 2
|
||||||
- Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: reconstruct syntactic structures from table images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 749-755. IEEE, 2019. 3
|
- [33] Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: reconstruct syntactic structures from table images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 749-755. IEEE, 2019. 3
|
||||||
- Wenyuan Xue, Baosheng Yu, Wen Wang, Dacheng Tao, and Qingyong Li. Tgrnet: A table graph reconstruction network for table structure recognition. arXiv preprint arXiv:2106.10598 , 2021. 3
|
- [34] Wenyuan Xue, Baosheng Yu, Wen Wang, Dacheng Tao, and Qingyong Li. Tgrnet: A table graph reconstruction network for table structure recognition. arXiv preprint arXiv:2106.10598 , 2021. 3
|
||||||
- Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4651-4659, 2016. 4
|
- [35] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4651-4659, 2016. 4
|
||||||
- Xinyi Zheng, Doug Burdick, Lucian Popa, Peter Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. Winter Conference for Applications in Computer Vision (WACV) , 2021. 2, 3
|
- [36] Xinyi Zheng, Doug Burdick, Lucian Popa, Peter Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. Winter Conference for Applications in Computer Vision (WACV) , 2021. 2, 3
|
||||||
- Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model,
|
- [37] Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model,
|
||||||
- and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020 , pages 564-580, Cham, 2020. Springer International Publishing. 2, 3, 7
|
- and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020 , pages 564-580, Cham, 2020. Springer International Publishing. 2, 3, 7
|
||||||
- Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1015-1022, 2019. 1
|
- [38] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1015-1022, 2019. 1
|
||||||
|
|
||||||
## TableFormer: Table Structure Understanding with Transformers Supplementary Material
|
## TableFormer: Table Structure Understanding with Transformers Supplementary Material
|
||||||
|
|
||||||
@ -343,11 +343,11 @@ Aiming to train and evaluate our models in a broader spectrum of table data we h
|
|||||||
|
|
||||||
The process of generating a synthetic dataset can be decomposed into the following steps:
|
The process of generating a synthetic dataset can be decomposed into the following steps:
|
||||||
|
|
||||||
- Prepare styling and content templates: The styling templates have been manually designed and organized into groups of scope specific appearances (e.g. financial data, marketing data, etc.) Additionally, we have prepared curated collections of content templates by extracting the most frequently used terms out of non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.).
|
1. Prepare styling and content templates: The styling templates have been manually designed and organized into groups of scope specific appearances (e.g. financial data, marketing data, etc.) Additionally, we have prepared curated collections of content templates by extracting the most frequently used terms out of non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.).
|
||||||
- Generate table structures: The structure of each synthetic dataset assumes a horizontal table header which potentially spans over multiple rows and a table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header - body boundary. The table structure is described by the parameters: Total number of table rows and columns, number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans), maximum span size and the ratio of the table area covered by spans.
|
2. Generate table structures: The structure of each synthetic dataset assumes a horizontal table header which potentially spans over multiple rows and a table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header - body boundary. The table structure is described by the parameters: Total number of table rows and columns, number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans), maximum span size and the ratio of the table area covered by spans.
|
||||||
- Generate content: Based on the dataset theme , a set of suitable content templates is chosen first. Then, this content can be combined with purely random text to produce the synthetic content.
|
3. Generate content: Based on the dataset theme , a set of suitable content templates is chosen first. Then, this content can be combined with purely random text to produce the synthetic content.
|
||||||
- Apply styling templates: Depending on the domain of the synthetic dataset, a set of styling templates is first manually selected. Then, a style is randomly selected to format the appearance of the synthesized table.
|
4. Apply styling templates: Depending on the domain of the synthetic dataset, a set of styling templates is first manually selected. Then, a style is randomly selected to format the appearance of the synthesized table.
|
||||||
- Render the complete tables: The synthetic table is finally rendered by a web browser engine to generate the bounding boxes for each table cell. A batching technique is utilized to optimize the runtime overhead of the rendering process.
|
5. Render the complete tables: The synthetic table is finally rendered by a web browser engine to generate the bounding boxes for each table cell. A batching technique is utilized to optimize the runtime overhead of the rendering process.
|
||||||
|
|
||||||
## 2. Prediction post-processing for PDF documents
|
## 2. Prediction post-processing for PDF documents
|
||||||
|
|
||||||
@ -366,21 +366,21 @@ However, it is possible to mitigate those limitations by combining the TableForm
|
|||||||
|
|
||||||
Here is a step-by-step description of the prediction postprocessing:
|
Here is a step-by-step description of the prediction postprocessing:
|
||||||
|
|
||||||
- Get the minimal grid dimensions - number of rows and columns for the predicted table structure. This represents the most granular grid for the underlying table structure.
|
1. Get the minimal grid dimensions - number of rows and columns for the predicted table structure. This represents the most granular grid for the underlying table structure.
|
||||||
- Generate pair-wise matches between the bounding boxes of the PDF cells and the predicted cells. The Intersection Over Union (IOU) metric is used to evaluate the quality of the matches.
|
2. Generate pair-wise matches between the bounding boxes of the PDF cells and the predicted cells. The Intersection Over Union (IOU) metric is used to evaluate the quality of the matches.
|
||||||
- Use a carefully selected IOU threshold to designate the matches as "good" ones and "bad" ones.
|
3. Use a carefully selected IOU threshold to designate the matches as "good" ones and "bad" ones.
|
||||||
- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.
|
- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.
|
||||||
- Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:
|
4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:
|
||||||
|
|
||||||
<!-- formula-not-decoded -->
|
<!-- formula-not-decoded -->
|
||||||
|
|
||||||
where c is one of { left, centroid, right } and x$\_{c}$ is the xcoordinate for the corresponding point.
|
where c is one of { left, centroid, right } and x$\_{c}$ is the xcoordinate for the corresponding point.
|
||||||
|
|
||||||
- Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-
|
5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-
|
||||||
- Snap all cells with bad IOU to their corresponding median x -coordinates and cell sizes.
|
6. Snap all cells with bad IOU to their corresponding median x -coordinates and cell sizes.
|
||||||
- Generate a new set of pair-wise matches between the corrected bounding boxes and PDF cells. This time use a modified version of the IOU metric, where the area of the intersection between the predicted and PDF cells is divided by the PDF cell area. In case there are multiple matches for the same PDF cell, the prediction with the higher score is preferred. This covers the cases where the PDF cells are smaller than the area of predicted or corrected prediction cells.
|
7. Generate a new set of pair-wise matches between the corrected bounding boxes and PDF cells. This time use a modified version of the IOU metric, where the area of the intersection between the predicted and PDF cells is divided by the PDF cell area. In case there are multiple matches for the same PDF cell, the prediction with the higher score is preferred. This covers the cases where the PDF cells are smaller than the area of predicted or corrected prediction cells.
|
||||||
- In some rare occasions, we have noticed that TableFormer can confuse a single column as two. When the postprocessing steps are applied, this results with two predicted columns pointing to the same PDF column. In such case we must de-duplicate the columns according to highest total column intersection score.
|
8. In some rare occasions, we have noticed that TableFormer can confuse a single column as two. When the postprocessing steps are applied, this results with two predicted columns pointing to the same PDF column. In such case we must de-duplicate the columns according to highest total column intersection score.
|
||||||
- Pick up the remaining orphan cells. There could be cases, when after applying all the previous post-processing steps, some PDF cells could still remain without any match to predicted cells. However, it is still possible to deduce the correct matching for an orphan PDF cell by mapping its bounding box on the geometry of the grid. This mapping decides if the content of the orphan cell will be appended to an already matched table cell, or a new table cell should be created to match with the orphan.
|
9. Pick up the remaining orphan cells. There could be cases, when after applying all the previous post-processing steps, some PDF cells could still remain without any match to predicted cells. However, it is still possible to deduce the correct matching for an orphan PDF cell by mapping its bounding box on the geometry of the grid. This mapping decides if the content of the orphan cell will be appended to an already matched table cell, or a new table cell should be created to match with the orphan.
|
||||||
|
|
||||||
9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).
|
9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "2206.01062",
|
"name": "2206.01062",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
68
tests/data/groundtruth/docling_v2/2206.01062.md
vendored
68
tests/data/groundtruth/docling_v2/2206.01062.md
vendored
@ -48,16 +48,16 @@ A key problem in the process of document conversion is to understand the structu
|
|||||||
|
|
||||||
In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:
|
In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:
|
||||||
|
|
||||||
- Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.
|
- (1) Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.
|
||||||
- Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.
|
- (2) Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.
|
||||||
- Detailed Label Set : We define 11 class labels to distinguish layout features in high detail. PubLayNet provides 5 labels; DocBank provides 13, although not a superset of ours.
|
- (3) Detailed Label Set : We define 11 class labels to distinguish layout features in high detail. PubLayNet provides 5 labels; DocBank provides 13, although not a superset of ours.
|
||||||
- Redundant Annotations : A fraction of the pages in the DocLayNet data set carry more than one human annotation.
|
- (4) Redundant Annotations : A fraction of the pages in the DocLayNet data set carry more than one human annotation.
|
||||||
|
|
||||||
$^{1}$https://developer.ibm.com/exchanges/data/all/doclaynet
|
$^{1}$https://developer.ibm.com/exchanges/data/all/doclaynet
|
||||||
|
|
||||||
This enables experimentation with annotation uncertainty and quality control analysis.
|
This enables experimentation with annotation uncertainty and quality control analysis.
|
||||||
|
|
||||||
- Pre-defined Train-, Test- & Validation-set : Like DocBank, we provide fixed train-, test- & validation-sets to ensure proportional representation of the class-labels. Further, we prevent leakage of unique layouts across sets, which has a large effect on model accuracy scores.
|
- (5) Pre-defined Train-, Test- & Validation-set : Like DocBank, we provide fixed train-, test- & validation-sets to ensure proportional representation of the class-labels. Further, we prevent leakage of unique layouts across sets, which has a large effect on model accuracy scores.
|
||||||
|
|
||||||
All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.
|
All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.
|
||||||
|
|
||||||
@ -137,12 +137,12 @@ At first sight, the task of visual document-layout interpretation appears intuit
|
|||||||
|
|
||||||
Obviously, this inconsistency in annotations is not desirable for datasets which are intended to be used for model training. To minimise these inconsistencies, we created a detailed annotation guideline. While perfect consistency across 40 annotation staff members is clearly not possible to achieve, we saw a huge improvement in annotation consistency after the introduction of our annotation guideline. A few selected, non-trivial highlights of the guideline are:
|
Obviously, this inconsistency in annotations is not desirable for datasets which are intended to be used for model training. To minimise these inconsistencies, we created a detailed annotation guideline. While perfect consistency across 40 annotation staff members is clearly not possible to achieve, we saw a huge improvement in annotation consistency after the introduction of our annotation guideline. A few selected, non-trivial highlights of the guideline are:
|
||||||
|
|
||||||
- Every list-item is an individual object instance with class label List-item . This definition is different from PubLayNet and DocBank, where all list-items are grouped together into one List object.
|
- (1) Every list-item is an individual object instance with class label List-item . This definition is different from PubLayNet and DocBank, where all list-items are grouped together into one List object.
|
||||||
- A List-item is a paragraph with hanging indentation. Singleline elements can qualify as List-item if the neighbour elements expose hanging indentation. Bullet or enumeration symbols are not a requirement.
|
- (2) A List-item is a paragraph with hanging indentation. Singleline elements can qualify as List-item if the neighbour elements expose hanging indentation. Bullet or enumeration symbols are not a requirement.
|
||||||
- For every Caption , there must be exactly one corresponding Picture or Table .
|
- (3) For every Caption , there must be exactly one corresponding Picture or Table .
|
||||||
- Connected sub-pictures are grouped together in one Picture object.
|
- (4) Connected sub-pictures are grouped together in one Picture object.
|
||||||
- Formula numbers are included in a Formula object.
|
- (5) Formula numbers are included in a Formula object.
|
||||||
- Emphasised text (e.g. in italic or bold) at the beginning of a paragraph is not considered a Section-header , unless it appears exclusively on its own line.
|
- (6) Emphasised text (e.g. in italic or bold) at the beginning of a paragraph is not considered a Section-header , unless it appears exclusively on its own line.
|
||||||
|
|
||||||
The complete annotation guideline is over 100 pages long and a detailed description is obviously out of scope for this paper. Nevertheless, it will be made publicly available alongside with DocLayNet for future reference.
|
The complete annotation guideline is over 100 pages long and a detailed description is obviously out of scope for this paper. Nevertheless, it will be made publicly available alongside with DocLayNet for future reference.
|
||||||
|
|
||||||
@ -284,19 +284,19 @@ To date, there is still a significant gap between human and ML accuracy on the l
|
|||||||
|
|
||||||
## REFERENCES
|
## REFERENCES
|
||||||
|
|
||||||
- Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.
|
- [1] Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.
|
||||||
- Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.
|
- [2] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.
|
||||||
- Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.
|
- [3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.
|
||||||
- Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.
|
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.
|
||||||
- Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.
|
- [5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.
|
||||||
- Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 1015-1022, sep 2019.
|
- [6] Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 1015-1022, sep 2019.
|
||||||
- Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics , COLING, pages 949-960. International Committee on Computational Linguistics, dec 2020.
|
- [7] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics , COLING, pages 949-960. International Committee on Computational Linguistics, dec 2020.
|
||||||
- Riaz Ahmad, Muhammad Tanvir Afzal, and M. Qadir. Information extraction from pdf sources based on rule-based system using integrated formats. In SemWebEval@ESWC , 2016.
|
- [8] Riaz Ahmad, Muhammad Tanvir Afzal, and M. Qadir. Information extraction from pdf sources based on rule-based system using integrated formats. In SemWebEval@ESWC , 2016.
|
||||||
- Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition , CVPR, pages 580-587. IEEE Computer Society, jun 2014.
|
- [9] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition , CVPR, pages 580-587. IEEE Computer Society, jun 2014.
|
||||||
- Ross B. Girshick. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision , ICCV, pages 1440-1448. IEEE Computer Society, dec 2015.
|
- [10] Ross B. Girshick. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision , ICCV, pages 1440-1448. IEEE Computer Society, dec 2015.
|
||||||
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(6):1137-1149, 2017.
|
- [11] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(6):1137-1149, 2017.
|
||||||
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision , ICCV, pages 2980-2988. IEEE Computer Society, Oct 2017.
|
- [12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision , ICCV, pages 2980-2988. IEEE Computer Society, Oct 2017.
|
||||||
- Glenn Jocher, Alex Stoken, Ayush Chaurasia, Jirka Borovec, NanoCode012, TaoXie, Yonghye Kwon, Kalen Michael, Liu Changyu, Jiacong Fang, Abhiram V, Laughing, tkianai, yxNONG, Piotr Skalski, Adam Hogan, Jebastin Nadar, imyhxy, Lorenzo Mammana, Alex Wang, Cristi Fati, Diego Montes, Jan Hajek, Laurentiu
|
- [13] Glenn Jocher, Alex Stoken, Ayush Chaurasia, Jirka Borovec, NanoCode012, TaoXie, Yonghye Kwon, Kalen Michael, Liu Changyu, Jiacong Fang, Abhiram V, Laughing, tkianai, yxNONG, Piotr Skalski, Adam Hogan, Jebastin Nadar, imyhxy, Lorenzo Mammana, Alex Wang, Cristi Fati, Diego Montes, Jan Hajek, Laurentiu
|
||||||
|
|
||||||
Text Caption List-Item Formula Table Section-Header Picture Page-Header Page-Footer Title
|
Text Caption List-Item Formula Table Section-Header Picture Page-Header Page-Footer Title
|
||||||
|
|
||||||
@ -306,13 +306,13 @@ Figure 6: Example layout predictions on selected pages from the DocLayNet test-s
|
|||||||
|
|
||||||
Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ultralytics/yolov5: v6.0 - yolov5n nano models, roboflow integration, tensorflow export, opencv dnn support, October 2021.
|
Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ultralytics/yolov5: v6.0 - yolov5n nano models, roboflow integration, tensorflow export, opencv dnn support, October 2021.
|
||||||
|
|
||||||
- Shoubin Li, Xuyan Ma, Shuaiqun Pan, Jun Hu, Lin Shi, and Qing Wang. Vtlayout: Fusion of visual and text features for document layout analysis, 2021.
|
- [20] Shoubin Li, Xuyan Ma, Shuaiqun Pan, Jun Hu, Lin Shi, and Qing Wang. Vtlayout: Fusion of visual and text features for document layout analysis, 2021.
|
||||||
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. CoRR , abs/2005.12872, 2020.
|
- [14] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. CoRR , abs/2005.12872, 2020.
|
||||||
- Mingxing Tan, Ruoming Pang, and Quoc V. Le. Efficientdet: Scalable and efficient object detection. CoRR , abs/1911.09070, 2019.
|
- [15] Mingxing Tan, Ruoming Pang, and Quoc V. Le. Efficientdet: Scalable and efficient object detection. CoRR , abs/1911.09070, 2019.
|
||||||
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context, 2014.
|
- [16] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context, 2014.
|
||||||
- Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2, 2019.
|
- [17] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2, 2019.
|
||||||
- Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter W. J. Staar. Robust pdf document conversion using recurrent neural networks. In Proceedings of the 35th Conference on Artificial Intelligence , AAAI, pages 1513715145, feb 2021.
|
- [18] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter W. J. Staar. Robust pdf document conversion using recurrent neural networks. In Proceedings of the 35th Conference on Artificial Intelligence , AAAI, pages 1513715145, feb 2021.
|
||||||
- Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD, pages 1192-1200, New York, USA, 2020. Association for Computing Machinery.
|
- [19] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD, pages 1192-1200, New York, USA, 2020. Association for Computing Machinery.
|
||||||
- Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. Vsr: A unified framework for document layout analysis combining vision, semantics and relations, 2021.
|
- [21] Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. Vsr: A unified framework for document layout analysis combining vision, semantics and relations, 2021.
|
||||||
- Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD, pages 774-782. ACM, 2018.
|
- [22] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD, pages 774-782. ACM, 2018.
|
||||||
- Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data , 6(1):60, 2019.
|
- [23] Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data , 6(1):60, 2019.
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "2305.03393v1-pg9",
|
"name": "2305.03393v1-pg9",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
@ -60,8 +60,6 @@
|
|||||||
<page_header><loc_159><loc_59><loc_366><loc_64>Optimized Table Tokenization for Table Structure Recognition</page_header>
|
<page_header><loc_159><loc_59><loc_366><loc_64>Optimized Table Tokenization for Table Structure Recognition</page_header>
|
||||||
<page_header><loc_389><loc_59><loc_393><loc_64>7</page_header>
|
<page_header><loc_389><loc_59><loc_393><loc_64>7</page_header>
|
||||||
<picture><loc_135><loc_103><loc_367><loc_177><caption><loc_110><loc_79><loc_393><loc_98>Fig. 3. OTSL description of table structure: A - table example; B - graphical representation of table structure; C - mapping structure on a grid; D - OTSL structure encoding; E - explanation on cell encoding</caption></picture>
|
<picture><loc_135><loc_103><loc_367><loc_177><caption><loc_110><loc_79><loc_393><loc_98>Fig. 3. OTSL description of table structure: A - table example; B - graphical representation of table structure; C - mapping structure on a grid; D - OTSL structure encoding; E - explanation on cell encoding</caption></picture>
|
||||||
<unordered_list><list_item><loc_273><loc_172><loc_349><loc_176>4 - 2d merges: "C", "L", "U", "X"</list_item>
|
|
||||||
</unordered_list>
|
|
||||||
<section_header_level_1><loc_110><loc_193><loc_202><loc_198>4.2 Language Syntax</section_header_level_1>
|
<section_header_level_1><loc_110><loc_193><loc_202><loc_198>4.2 Language Syntax</section_header_level_1>
|
||||||
<text><loc_110><loc_205><loc_297><loc_211>The OTSL representation follows these syntax rules:</text>
|
<text><loc_110><loc_205><loc_297><loc_211>The OTSL representation follows these syntax rules:</text>
|
||||||
<unordered_list><list_item><loc_114><loc_219><loc_393><loc_232>Left-looking cell rule : The left neighbour of an "L" cell must be either another "L" cell or a "C" cell.</list_item>
|
<unordered_list><list_item><loc_114><loc_219><loc_393><loc_232>Left-looking cell rule : The left neighbour of an "L" cell must be either another "L" cell or a "C" cell.</list_item>
|
||||||
|
1084
tests/data/groundtruth/docling_v2/2305.03393v1.json
vendored
1084
tests/data/groundtruth/docling_v2/2305.03393v1.json
vendored
File diff suppressed because it is too large
Load Diff
@ -84,21 +84,19 @@ Fig. 3. OTSL description of table structure: A - table example; B - graphical re
|
|||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
- 4 - 2d merges: "C", "L", "U", "X"
|
|
||||||
|
|
||||||
## 4.2 Language Syntax
|
## 4.2 Language Syntax
|
||||||
|
|
||||||
The OTSL representation follows these syntax rules:
|
The OTSL representation follows these syntax rules:
|
||||||
|
|
||||||
- Left-looking cell rule : The left neighbour of an "L" cell must be either another "L" cell or a "C" cell.
|
1. Left-looking cell rule : The left neighbour of an "L" cell must be either another "L" cell or a "C" cell.
|
||||||
- Up-looking cell rule : The upper neighbour of a "U" cell must be either another "U" cell or a "C" cell.
|
2. Up-looking cell rule : The upper neighbour of a "U" cell must be either another "U" cell or a "C" cell.
|
||||||
|
|
||||||
## 3. Cross cell rule :
|
## 3. Cross cell rule :
|
||||||
|
|
||||||
- The left neighbour of an "X" cell must be either another "X" cell or a "U" cell, and the upper neighbour of an "X" cell must be either another "X" cell or an "L" cell.
|
- The left neighbour of an "X" cell must be either another "X" cell or a "U" cell, and the upper neighbour of an "X" cell must be either another "X" cell or an "L" cell.
|
||||||
- First row rule : Only "L" cells and "C" cells are allowed in the first row.
|
4. First row rule : Only "L" cells and "C" cells are allowed in the first row.
|
||||||
- First column rule : Only "U" cells and "C" cells are allowed in the first column.
|
5. First column rule : Only "U" cells and "C" cells are allowed in the first column.
|
||||||
- Rectangular rule : The table representation is always rectangular - all rows must have an equal number of tokens, terminated with "NL" token.
|
6. Rectangular rule : The table representation is always rectangular - all rows must have an equal number of tokens, terminated with "NL" token.
|
||||||
|
|
||||||
The application of these rules gives OTSL a set of unique properties. First of all, the OTSL enforces a strictly rectangular structure representation, where every new-line token starts a new row. As a consequence, all rows and all columns have exactly the same number of tokens, irrespective of cell spans. Secondly, the OTSL representation is unambiguous: Every table structure is represented in one way. In this representation every table cell corresponds to a "C"-cell token, which in case of spans is always located in the top-left corner of the table cell definition. Third, OTSL syntax rules are only backward-looking. As a consequence, every predicted token can be validated straight during sequence generation by looking at the previously predicted sequence. As such, OTSL can guarantee that every predicted sequence is syntactically valid.
|
The application of these rules gives OTSL a set of unique properties. First of all, the OTSL enforces a strictly rectangular structure representation, where every new-line token starts a new row. As a consequence, all rows and all columns have exactly the same number of tokens, irrespective of cell spans. Secondly, the OTSL representation is unambiguous: Every table structure is represented in one way. In this representation every table cell corresponds to a "C"-cell token, which in case of spans is always located in the top-left corner of the table cell definition. Third, OTSL syntax rules are only backward-looking. As a consequence, every predicted token can be validated straight during sequence generation by looking at the previously predicted sequence. As such, OTSL can guarantee that every predicted sequence is syntactically valid.
|
||||||
|
|
||||||
@ -177,28 +175,28 @@ Secondly, OTSL has more inherent structure and a significantly restricted vocabu
|
|||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- Auer, C., Dolfi, M., Carvalho, A., Ramis, C.B., Staar, P.W.J.: Delivering document conversion as a cloud service with high throughput and responsiveness. CoRR abs/2206.00785 (2022). https://doi.org/10.48550/arXiv.2206.00785 , https://doi.org/10.48550/arXiv.2206.00785
|
1. Auer, C., Dolfi, M., Carvalho, A., Ramis, C.B., Staar, P.W.J.: Delivering document conversion as a cloud service with high throughput and responsiveness. CoRR abs/2206.00785 (2022). https://doi.org/10.48550/arXiv.2206.00785 , https://doi.org/10.48550/arXiv.2206.00785
|
||||||
- Chen, B., Peng, D., Zhang, J., Ren, Y., Jin, L.: Complex table structure recognition in the wild using transformer and identity matrix-based augmentation. In: Porwal, U., Fornés, A., Shafait, F. (eds.) Frontiers in Handwriting Recognition. pp. 545561. Springer International Publishing, Cham (2022)
|
2. Chen, B., Peng, D., Zhang, J., Ren, Y., Jin, L.: Complex table structure recognition in the wild using transformer and identity matrix-based augmentation. In: Porwal, U., Fornés, A., Shafait, F. (eds.) Frontiers in Handwriting Recognition. pp. 545561. Springer International Publishing, Cham (2022)
|
||||||
- Chi, Z., Huang, H., Xu, H.D., Yu, H., Yin, W., Mao, X.L.: Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019)
|
3. Chi, Z., Huang, H., Xu, H.D., Yu, H., Yin, W., Mao, X.L.: Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019)
|
||||||
- Deng, Y., Rosenberg, D., Mann, G.: Challenges in end-to-end neural scientific table recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 894-901. IEEE (2019)
|
4. Deng, Y., Rosenberg, D., Mann, G.: Challenges in end-to-end neural scientific table recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 894-901. IEEE (2019)
|
||||||
|
|
||||||
- Kayal, P., Anand, M., Desai, H., Singh, M.: Tables to latex: structure and content extraction from scientific tables. International Journal on Document Analysis and Recognition (IJDAR) pp. 1-10 (2022)
|
5. Kayal, P., Anand, M., Desai, H., Singh, M.: Tables to latex: structure and content extraction from scientific tables. International Journal on Document Analysis and Recognition (IJDAR) pp. 1-10 (2022)
|
||||||
- Lee, E., Kwon, J., Yang, H., Park, J., Lee, S., Koo, H.I., Cho, N.I.: Table structure recognition based on grid shape graph. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). pp. 18681873. IEEE (2022)
|
6. Lee, E., Kwon, J., Yang, H., Park, J., Lee, S., Koo, H.I., Cho, N.I.: Table structure recognition based on grid shape graph. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). pp. 18681873. IEEE (2022)
|
||||||
- Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: Tablebank: A benchmark dataset for table detection and recognition (2019)
|
7. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: Tablebank: A benchmark dataset for table detection and recognition (2019)
|
||||||
- Livathinos, N., Berrospi, C., Lysak, M., Kuropiatnyk, V., Nassar, A., Carvalho, A., Dolfi, M., Auer, C., Dinkla, K., Staar, P.: Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence 35 (17), 15137-15145 (May 2021), https://ojs.aaai.org/index.php/ AAAI/article/view/17777
|
8. Livathinos, N., Berrospi, C., Lysak, M., Kuropiatnyk, V., Nassar, A., Carvalho, A., Dolfi, M., Auer, C., Dinkla, K., Staar, P.: Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence 35 (17), 15137-15145 (May 2021), https://ojs.aaai.org/index.php/ AAAI/article/view/17777
|
||||||
- Nassar, A., Livathinos, N., Lysak, M., Staar, P.: Tableformer: Table structure understanding with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4614-4623 (June 2022)
|
9. Nassar, A., Livathinos, N., Lysak, M., Staar, P.: Tableformer: Table structure understanding with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4614-4623 (June 2022)
|
||||||
- Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P.W.J.: Doclaynet: A large human-annotated dataset for document-layout segmentation. In: Zhang, A., Rangwala, H. (eds.) KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022. pp. 3743-3751. ACM (2022). https://doi.org/10.1145/3534678.3539043 , https:// doi.org/10.1145/3534678.3539043
|
10. Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P.W.J.: Doclaynet: A large human-annotated dataset for document-layout segmentation. In: Zhang, A., Rangwala, H. (eds.) KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022. pp. 3743-3751. ACM (2022). https://doi.org/10.1145/3534678.3539043 , https:// doi.org/10.1145/3534678.3539043
|
||||||
- Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: Cascadetabnet: An approach for end to end table detection and structure recognition from imagebased documents. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. pp. 572-573 (2020)
|
11. Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: Cascadetabnet: An approach for end to end table detection and structure recognition from imagebased documents. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. pp. 572-573 (2020)
|
||||||
- Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1162-1167. IEEE (2017)
|
12. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1162-1167. IEEE (2017)
|
||||||
- Siddiqui, S.A., Fateh, I.A., Rizvi, S.T.R., Dengel, A., Ahmed, S.: Deeptabstr: Deep learning based table structure recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1403-1409 (2019). https:// doi.org/10.1109/ICDAR.2019.00226
|
13. Siddiqui, S.A., Fateh, I.A., Rizvi, S.T.R., Dengel, A., Ahmed, S.: Deeptabstr: Deep learning based table structure recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1403-1409 (2019). https:// doi.org/10.1109/ICDAR.2019.00226
|
||||||
- Smock, B., Pesala, R., Abraham, R.: PubTables-1M: Towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4634-4642 (June 2022)
|
14. Smock, B., Pesala, R., Abraham, R.: PubTables-1M: Towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4634-4642 (June 2022)
|
||||||
- Staar, P.W.J., Dolfi, M., Auer, C., Bekas, C.: Corpus conversion service: A machine learning platform to ingest documents at scale. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 774-782. KDD '18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3219819.3219834 , https://doi.org/10. 1145/3219819.3219834
|
15. Staar, P.W.J., Dolfi, M., Auer, C., Bekas, C.: Corpus conversion service: A machine learning platform to ingest documents at scale. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 774-782. KDD '18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3219819.3219834 , https://doi.org/10. 1145/3219819.3219834
|
||||||
- Wang, X.: Tabular Abstraction, Editing, and Formatting. Ph.D. thesis, CAN (1996), aAINN09397
|
16. Wang, X.: Tabular Abstraction, Editing, and Formatting. Ph.D. thesis, CAN (1996), aAINN09397
|
||||||
- Xue, W., Li, Q., Tao, D.: Res2tim: Reconstruct syntactic structures from table images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 749-755. IEEE (2019)
|
17. Xue, W., Li, Q., Tao, D.: Res2tim: Reconstruct syntactic structures from table images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 749-755. IEEE (2019)
|
||||||
|
|
||||||
- Xue, W., Yu, B., Wang, W., Tao, D., Li, Q.: Tgrnet: A table graph reconstruction network for table structure recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1295-1304 (2021)
|
18. Xue, W., Yu, B., Wang, W., Tao, D., Li, Q.: Tgrnet: A table graph reconstruction network for table structure recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1295-1304 (2021)
|
||||||
- Ye, J., Qi, X., He, Y., Chen, Y., Gu, D., Gao, P., Xiao, R.: Pingan-vcgroup's solution for icdar 2021 competition on scientific literature parsing task b: Table recognition to html (2021). https://doi.org/10.48550/ARXIV.2105.01848 , https://arxiv.org/abs/2105.01848
|
19. Ye, J., Qi, X., He, Y., Chen, Y., Gu, D., Gao, P., Xiao, R.: Pingan-vcgroup's solution for icdar 2021 competition on scientific literature parsing task b: Table recognition to html (2021). https://doi.org/10.48550/ARXIV.2105.01848 , https://arxiv.org/abs/2105.01848
|
||||||
- Zhang, Z., Zhang, J., Du, J., Wang, F.: Split, embed and merge: An accurate table structure recognizer. Pattern Recognition 126 , 108565 (2022)
|
20. Zhang, Z., Zhang, J., Du, J., Wang, F.: Split, embed and merge: An accurate table structure recognizer. Pattern Recognition 126 , 108565 (2022)
|
||||||
- Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 697-706 (2021). https://doi.org/10.1109/WACV48630.2021. 00074
|
21. Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 697-706 (2021). https://doi.org/10.1109/WACV48630.2021. 00074
|
||||||
- Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: Data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision - ECCV 2020. pp. 564-580. Springer International Publishing, Cham (2020)
|
22. Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: Data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision - ECCV 2020. pp. 564-580. Springer International Publishing, Cham (2020)
|
||||||
- Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1015-1022. IEEE (2019)
|
23. Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1015-1022. IEEE (2019)
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "amt_handbook_sample",
|
"name": "amt_handbook_sample",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "code_and_formula",
|
"name": "code_and_formula",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "csv-comma-in-cell",
|
"name": "csv-comma-in-cell",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/csv",
|
"mimetype": "text/csv",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "csv-comma",
|
"name": "csv-comma",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/csv",
|
"mimetype": "text/csv",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "csv-inconsistent-header",
|
"name": "csv-inconsistent-header",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/csv",
|
"mimetype": "text/csv",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "csv-pipe",
|
"name": "csv-pipe",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/csv",
|
"mimetype": "text/csv",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "csv-semicolon",
|
"name": "csv-semicolon",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/csv",
|
"mimetype": "text/csv",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "csv-tab",
|
"name": "csv-tab",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/csv",
|
"mimetype": "text/csv",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "csv-too-few-columns",
|
"name": "csv-too-few-columns",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/csv",
|
"mimetype": "text/csv",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "csv-too-many-columns",
|
"name": "csv-too-many-columns",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/csv",
|
"mimetype": "text/csv",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "equations",
|
"name": "equations",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
|
@ -7,6 +7,9 @@ item-0 at level 0: unspecified: group _root_
|
|||||||
item-6 at level 3: list: group list
|
item-6 at level 3: list: group list
|
||||||
item-7 at level 4: list_item: First item in unordered list
|
item-7 at level 4: list_item: First item in unordered list
|
||||||
item-8 at level 4: list_item: Second item in unordered list
|
item-8 at level 4: list_item: Second item in unordered list
|
||||||
item-9 at level 3: ordered_list: group ordered list
|
item-9 at level 3: list: group ordered list
|
||||||
item-10 at level 4: list_item: First item in ordered list
|
item-10 at level 4: list_item: First item in ordered list
|
||||||
item-11 at level 4: list_item: Second item in ordered list
|
item-11 at level 4: list_item: Second item in ordered list
|
||||||
|
item-12 at level 3: list: group ordered list start 42
|
||||||
|
item-13 at level 4: list_item: First item in ordered list with start
|
||||||
|
item-14 at level 4: list_item: Second item in ordered list with start
|
@ -1,10 +1,10 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "example_01",
|
"name": "example_01",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
"binary_hash": 13782069548509991617,
|
"binary_hash": 13726679883013609282,
|
||||||
"filename": "example_01.html"
|
"filename": "example_01.html"
|
||||||
},
|
},
|
||||||
"furniture": {
|
"furniture": {
|
||||||
@ -58,7 +58,24 @@
|
|||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"name": "ordered list",
|
"name": "ordered list",
|
||||||
"label": "ordered_list"
|
"label": "list"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/2",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/texts/2"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/8"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/9"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "ordered list start 42",
|
||||||
|
"label": "list"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"texts": [
|
"texts": [
|
||||||
@ -110,6 +127,9 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/1"
|
"$ref": "#/groups/1"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/2"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -143,7 +163,7 @@
|
|||||||
"orig": "First item in unordered list",
|
"orig": "First item in unordered list",
|
||||||
"text": "First item in unordered list",
|
"text": "First item in unordered list",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/5",
|
"self_ref": "#/texts/5",
|
||||||
@ -157,7 +177,7 @@
|
|||||||
"orig": "Second item in unordered list",
|
"orig": "Second item in unordered list",
|
||||||
"text": "Second item in unordered list",
|
"text": "Second item in unordered list",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/6",
|
"self_ref": "#/texts/6",
|
||||||
@ -171,7 +191,7 @@
|
|||||||
"orig": "First item in ordered list",
|
"orig": "First item in ordered list",
|
||||||
"text": "First item in ordered list",
|
"text": "First item in ordered list",
|
||||||
"enumerated": true,
|
"enumerated": true,
|
||||||
"marker": "1."
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/7",
|
"self_ref": "#/texts/7",
|
||||||
@ -185,7 +205,35 @@
|
|||||||
"orig": "Second item in ordered list",
|
"orig": "Second item in ordered list",
|
||||||
"text": "Second item in ordered list",
|
"text": "Second item in ordered list",
|
||||||
"enumerated": true,
|
"enumerated": true,
|
||||||
"marker": "2."
|
"marker": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/8",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "First item in ordered list with start",
|
||||||
|
"text": "First item in ordered list with start",
|
||||||
|
"enumerated": true,
|
||||||
|
"marker": "42."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/9",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Second item in ordered list with start",
|
||||||
|
"text": "Second item in ordered list with start",
|
||||||
|
"enumerated": true,
|
||||||
|
"marker": "43."
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"pictures": [
|
"pictures": [
|
||||||
|
@ -13,3 +13,6 @@ Some background information here.
|
|||||||
|
|
||||||
1. First item in ordered list
|
1. First item in ordered list
|
||||||
2. Second item in ordered list
|
2. Second item in ordered list
|
||||||
|
|
||||||
|
42. First item in ordered list with start
|
||||||
|
43. Second item in ordered list with start
|
@ -6,6 +6,6 @@ item-0 at level 0: unspecified: group _root_
|
|||||||
item-5 at level 3: list: group list
|
item-5 at level 3: list: group list
|
||||||
item-6 at level 4: list_item: First item in unordered list
|
item-6 at level 4: list_item: First item in unordered list
|
||||||
item-7 at level 4: list_item: Second item in unordered list
|
item-7 at level 4: list_item: Second item in unordered list
|
||||||
item-8 at level 3: ordered_list: group ordered list
|
item-8 at level 3: list: group ordered list
|
||||||
item-9 at level 4: list_item: First item in ordered list
|
item-9 at level 4: list_item: First item in ordered list
|
||||||
item-10 at level 4: list_item: Second item in ordered list
|
item-10 at level 4: list_item: Second item in ordered list
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "example_02",
|
"name": "example_02",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
@ -58,7 +58,7 @@
|
|||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"name": "ordered list",
|
"name": "ordered list",
|
||||||
"label": "ordered_list"
|
"label": "list"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"texts": [
|
"texts": [
|
||||||
@ -140,7 +140,7 @@
|
|||||||
"orig": "First item in unordered list",
|
"orig": "First item in unordered list",
|
||||||
"text": "First item in unordered list",
|
"text": "First item in unordered list",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/5",
|
"self_ref": "#/texts/5",
|
||||||
@ -154,7 +154,7 @@
|
|||||||
"orig": "Second item in unordered list",
|
"orig": "Second item in unordered list",
|
||||||
"text": "Second item in unordered list",
|
"text": "Second item in unordered list",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/6",
|
"self_ref": "#/texts/6",
|
||||||
@ -168,7 +168,7 @@
|
|||||||
"orig": "First item in ordered list",
|
"orig": "First item in ordered list",
|
||||||
"text": "First item in ordered list",
|
"text": "First item in ordered list",
|
||||||
"enumerated": true,
|
"enumerated": true,
|
||||||
"marker": "1."
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/7",
|
"self_ref": "#/texts/7",
|
||||||
@ -182,7 +182,7 @@
|
|||||||
"orig": "Second item in ordered list",
|
"orig": "Second item in ordered list",
|
||||||
"text": "Second item in ordered list",
|
"text": "Second item in ordered list",
|
||||||
"enumerated": true,
|
"enumerated": true,
|
||||||
"marker": "2."
|
"marker": ""
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"pictures": [],
|
"pictures": [],
|
||||||
|
@ -10,9 +10,9 @@ item-0 at level 0: unspecified: group _root_
|
|||||||
item-9 at level 6: list_item: Nested item 1
|
item-9 at level 6: list_item: Nested item 1
|
||||||
item-10 at level 6: list_item: Nested item 2
|
item-10 at level 6: list_item: Nested item 2
|
||||||
item-11 at level 4: list_item: Second item in unordered list
|
item-11 at level 4: list_item: Second item in unordered list
|
||||||
item-12 at level 3: ordered_list: group ordered list
|
item-12 at level 3: list: group ordered list
|
||||||
item-13 at level 4: list_item: First item in ordered list
|
item-13 at level 4: list_item: First item in ordered list
|
||||||
item-14 at level 5: ordered_list: group ordered list
|
item-14 at level 5: list: group ordered list
|
||||||
item-15 at level 6: list_item: Nested ordered item 1
|
item-15 at level 6: list_item: Nested ordered item 1
|
||||||
item-16 at level 6: list_item: Nested ordered item 2
|
item-16 at level 6: list_item: Nested ordered item 2
|
||||||
item-17 at level 4: list_item: Second item in ordered list
|
item-17 at level 4: list_item: Second item in ordered list
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "example_03",
|
"name": "example_03",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
@ -75,7 +75,7 @@
|
|||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"name": "ordered list",
|
"name": "ordered list",
|
||||||
"label": "ordered_list"
|
"label": "list"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/groups/3",
|
"self_ref": "#/groups/3",
|
||||||
@ -92,7 +92,7 @@
|
|||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"name": "ordered list",
|
"name": "ordered list",
|
||||||
"label": "ordered_list"
|
"label": "list"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"texts": [
|
"texts": [
|
||||||
@ -198,7 +198,7 @@
|
|||||||
"orig": "First item in unordered list",
|
"orig": "First item in unordered list",
|
||||||
"text": "First item in unordered list",
|
"text": "First item in unordered list",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/6",
|
"self_ref": "#/texts/6",
|
||||||
@ -212,7 +212,7 @@
|
|||||||
"orig": "Nested item 1",
|
"orig": "Nested item 1",
|
||||||
"text": "Nested item 1",
|
"text": "Nested item 1",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/7",
|
"self_ref": "#/texts/7",
|
||||||
@ -226,7 +226,7 @@
|
|||||||
"orig": "Nested item 2",
|
"orig": "Nested item 2",
|
||||||
"text": "Nested item 2",
|
"text": "Nested item 2",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/8",
|
"self_ref": "#/texts/8",
|
||||||
@ -240,7 +240,7 @@
|
|||||||
"orig": "Second item in unordered list",
|
"orig": "Second item in unordered list",
|
||||||
"text": "Second item in unordered list",
|
"text": "Second item in unordered list",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/9",
|
"self_ref": "#/texts/9",
|
||||||
@ -258,7 +258,7 @@
|
|||||||
"orig": "First item in ordered list",
|
"orig": "First item in ordered list",
|
||||||
"text": "First item in ordered list",
|
"text": "First item in ordered list",
|
||||||
"enumerated": true,
|
"enumerated": true,
|
||||||
"marker": "1"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/10",
|
"self_ref": "#/texts/10",
|
||||||
@ -272,7 +272,7 @@
|
|||||||
"orig": "Nested ordered item 1",
|
"orig": "Nested ordered item 1",
|
||||||
"text": "Nested ordered item 1",
|
"text": "Nested ordered item 1",
|
||||||
"enumerated": true,
|
"enumerated": true,
|
||||||
"marker": "1."
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/11",
|
"self_ref": "#/texts/11",
|
||||||
@ -286,7 +286,7 @@
|
|||||||
"orig": "Nested ordered item 2",
|
"orig": "Nested ordered item 2",
|
||||||
"text": "Nested ordered item 2",
|
"text": "Nested ordered item 2",
|
||||||
"enumerated": true,
|
"enumerated": true,
|
||||||
"marker": "2."
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/12",
|
"self_ref": "#/texts/12",
|
||||||
@ -300,7 +300,7 @@
|
|||||||
"orig": "Second item in ordered list",
|
"orig": "Second item in ordered list",
|
||||||
"text": "Second item in ordered list",
|
"text": "Second item in ordered list",
|
||||||
"enumerated": true,
|
"enumerated": true,
|
||||||
"marker": "2."
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/13",
|
"self_ref": "#/texts/13",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "example_04",
|
"name": "example_04",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "example_05",
|
"name": "example_05",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "example_06",
|
"name": "example_06",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "example_07",
|
"name": "example_07",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
@ -169,7 +169,7 @@
|
|||||||
"orig": "Asia",
|
"orig": "Asia",
|
||||||
"text": "Asia",
|
"text": "Asia",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/1",
|
"self_ref": "#/texts/1",
|
||||||
@ -183,7 +183,7 @@
|
|||||||
"orig": "China",
|
"orig": "China",
|
||||||
"text": "China",
|
"text": "China",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/2",
|
"self_ref": "#/texts/2",
|
||||||
@ -197,7 +197,7 @@
|
|||||||
"orig": "Japan",
|
"orig": "Japan",
|
||||||
"text": "Japan",
|
"text": "Japan",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/3",
|
"self_ref": "#/texts/3",
|
||||||
@ -211,7 +211,7 @@
|
|||||||
"orig": "Thailand",
|
"orig": "Thailand",
|
||||||
"text": "Thailand",
|
"text": "Thailand",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/4",
|
"self_ref": "#/texts/4",
|
||||||
@ -229,7 +229,7 @@
|
|||||||
"orig": "Europe",
|
"orig": "Europe",
|
||||||
"text": "Europe",
|
"text": "Europe",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/5",
|
"self_ref": "#/texts/5",
|
||||||
@ -243,7 +243,7 @@
|
|||||||
"orig": "UK",
|
"orig": "UK",
|
||||||
"text": "UK",
|
"text": "UK",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/6",
|
"self_ref": "#/texts/6",
|
||||||
@ -257,7 +257,7 @@
|
|||||||
"orig": "Germany",
|
"orig": "Germany",
|
||||||
"text": "Germany",
|
"text": "Germany",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/7",
|
"self_ref": "#/texts/7",
|
||||||
@ -275,7 +275,7 @@
|
|||||||
"orig": "Switzerland",
|
"orig": "Switzerland",
|
||||||
"text": "Switzerland",
|
"text": "Switzerland",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/8",
|
"self_ref": "#/texts/8",
|
||||||
@ -289,7 +289,7 @@
|
|||||||
"orig": "Bern",
|
"orig": "Bern",
|
||||||
"text": "Bern",
|
"text": "Bern",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/9",
|
"self_ref": "#/texts/9",
|
||||||
@ -303,7 +303,7 @@
|
|||||||
"orig": "Aargau",
|
"orig": "Aargau",
|
||||||
"text": "Aargau",
|
"text": "Aargau",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/10",
|
"self_ref": "#/texts/10",
|
||||||
@ -321,7 +321,7 @@
|
|||||||
"orig": "Italy",
|
"orig": "Italy",
|
||||||
"text": "Italy",
|
"text": "Italy",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/11",
|
"self_ref": "#/texts/11",
|
||||||
@ -335,7 +335,7 @@
|
|||||||
"orig": "Piedmont",
|
"orig": "Piedmont",
|
||||||
"text": "Piedmont",
|
"text": "Piedmont",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/12",
|
"self_ref": "#/texts/12",
|
||||||
@ -349,7 +349,7 @@
|
|||||||
"orig": "Liguria",
|
"orig": "Liguria",
|
||||||
"text": "Liguria",
|
"text": "Liguria",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/13",
|
"self_ref": "#/texts/13",
|
||||||
@ -363,7 +363,7 @@
|
|||||||
"orig": "Africa",
|
"orig": "Africa",
|
||||||
"text": "Africa",
|
"text": "Africa",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"pictures": [],
|
"pictures": [],
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "example_08",
|
"name": "example_08",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
|
@ -11,8 +11,22 @@ Create your feature branch: `git checkout -b feature/AmazingFeature` .
|
|||||||
3. Commit your changes ( `git commit -m 'Add some AmazingFeature'` )
|
3. Commit your changes ( `git commit -m 'Add some AmazingFeature'` )
|
||||||
4. Push to the branch ( `git push origin feature/AmazingFeature` )
|
4. Push to the branch ( `git push origin feature/AmazingFeature` )
|
||||||
5. Open a Pull Request
|
5. Open a Pull Request
|
||||||
|
6. **Whole list item has same formatting**
|
||||||
|
7. List item has *mixed or partial* formatting
|
||||||
|
|
||||||
## *Second* section
|
# *Whole heading is italic*
|
||||||
|
|
||||||
- **First** : Lorem ipsum.
|
- **First** : Lorem ipsum.
|
||||||
- **Second** : Dolor `sit` amet.
|
- **Second** : Dolor `sit` amet.
|
||||||
|
|
||||||
|
Some *`formatted_code`*
|
||||||
|
|
||||||
|
## *Partially formatted* heading to\_escape `not_to_escape`
|
||||||
|
|
||||||
|
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
|
||||||
|
|
||||||
|
## Table Heading
|
||||||
|
|
||||||
|
| Bold Heading | Italic Heading |
|
||||||
|
|----------------|------------------|
|
||||||
|
| data a | data b |
|
||||||
|
@ -5,8 +5,14 @@ body:
|
|||||||
- $ref: '#/groups/0'
|
- $ref: '#/groups/0'
|
||||||
- $ref: '#/groups/1'
|
- $ref: '#/groups/1'
|
||||||
- $ref: '#/groups/2'
|
- $ref: '#/groups/2'
|
||||||
- $ref: '#/texts/27'
|
- $ref: '#/texts/32'
|
||||||
- $ref: '#/groups/8'
|
- $ref: '#/groups/8'
|
||||||
|
- $ref: '#/groups/11'
|
||||||
|
- $ref: '#/texts/43'
|
||||||
|
- $ref: '#/texts/47'
|
||||||
|
- $ref: '#/texts/48'
|
||||||
|
- $ref: '#/groups/13'
|
||||||
|
- $ref: '#/tables/0'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: unspecified
|
label: unspecified
|
||||||
name: _root_
|
name: _root_
|
||||||
@ -47,8 +53,10 @@ groups:
|
|||||||
- $ref: '#/texts/18'
|
- $ref: '#/texts/18'
|
||||||
- $ref: '#/texts/22'
|
- $ref: '#/texts/22'
|
||||||
- $ref: '#/texts/26'
|
- $ref: '#/texts/26'
|
||||||
|
- $ref: '#/texts/27'
|
||||||
|
- $ref: '#/texts/28'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: ordered_list
|
label: list
|
||||||
name: list
|
name: list
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/body'
|
$ref: '#/body'
|
||||||
@ -94,53 +102,216 @@ groups:
|
|||||||
$ref: '#/texts/22'
|
$ref: '#/texts/22'
|
||||||
self_ref: '#/groups/6'
|
self_ref: '#/groups/6'
|
||||||
- children:
|
- children:
|
||||||
- $ref: '#/texts/28'
|
|
||||||
- $ref: '#/texts/29'
|
- $ref: '#/texts/29'
|
||||||
|
- $ref: '#/texts/30'
|
||||||
|
- $ref: '#/texts/31'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: inline
|
label: inline
|
||||||
name: group
|
name: group
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/texts/27'
|
$ref: '#/texts/28'
|
||||||
self_ref: '#/groups/7'
|
self_ref: '#/groups/7'
|
||||||
- children:
|
- children:
|
||||||
- $ref: '#/texts/30'
|
|
||||||
- $ref: '#/texts/33'
|
- $ref: '#/texts/33'
|
||||||
|
- $ref: '#/texts/36'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: list
|
label: list
|
||||||
name: list
|
name: list
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/body'
|
$ref: '#/body'
|
||||||
self_ref: '#/groups/8'
|
self_ref: '#/groups/8'
|
||||||
- children:
|
|
||||||
- $ref: '#/texts/31'
|
|
||||||
- $ref: '#/texts/32'
|
|
||||||
content_layer: body
|
|
||||||
label: inline
|
|
||||||
name: group
|
|
||||||
parent:
|
|
||||||
$ref: '#/texts/30'
|
|
||||||
self_ref: '#/groups/9'
|
|
||||||
- children:
|
- children:
|
||||||
- $ref: '#/texts/34'
|
- $ref: '#/texts/34'
|
||||||
- $ref: '#/texts/35'
|
- $ref: '#/texts/35'
|
||||||
- $ref: '#/texts/36'
|
|
||||||
- $ref: '#/texts/37'
|
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: inline
|
label: inline
|
||||||
name: group
|
name: group
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/texts/33'
|
$ref: '#/texts/33'
|
||||||
|
self_ref: '#/groups/9'
|
||||||
|
- children:
|
||||||
|
- $ref: '#/texts/37'
|
||||||
|
- $ref: '#/texts/38'
|
||||||
|
- $ref: '#/texts/39'
|
||||||
|
- $ref: '#/texts/40'
|
||||||
|
content_layer: body
|
||||||
|
label: inline
|
||||||
|
name: group
|
||||||
|
parent:
|
||||||
|
$ref: '#/texts/36'
|
||||||
self_ref: '#/groups/10'
|
self_ref: '#/groups/10'
|
||||||
|
- children:
|
||||||
|
- $ref: '#/texts/41'
|
||||||
|
- $ref: '#/texts/42'
|
||||||
|
content_layer: body
|
||||||
|
label: inline
|
||||||
|
name: group
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
self_ref: '#/groups/11'
|
||||||
|
- children:
|
||||||
|
- $ref: '#/texts/44'
|
||||||
|
- $ref: '#/texts/45'
|
||||||
|
- $ref: '#/texts/46'
|
||||||
|
content_layer: body
|
||||||
|
label: inline
|
||||||
|
name: group
|
||||||
|
parent:
|
||||||
|
$ref: '#/texts/43'
|
||||||
|
self_ref: '#/groups/12'
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: inline
|
||||||
|
name: group
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
self_ref: '#/groups/13'
|
||||||
key_value_items: []
|
key_value_items: []
|
||||||
name: inline_and_formatting
|
name: inline_and_formatting
|
||||||
origin:
|
origin:
|
||||||
binary_hash: 9342273634728023910
|
binary_hash: 14550011543526094526
|
||||||
filename: inline_and_formatting.md
|
filename: inline_and_formatting.md
|
||||||
mimetype: text/markdown
|
mimetype: text/markdown
|
||||||
pages: {}
|
pages: {}
|
||||||
pictures: []
|
pictures: []
|
||||||
schema_name: DoclingDocument
|
schema_name: DoclingDocument
|
||||||
tables: []
|
tables:
|
||||||
|
- annotations: []
|
||||||
|
captions: []
|
||||||
|
children: []
|
||||||
|
content_layer: body
|
||||||
|
data:
|
||||||
|
grid:
|
||||||
|
- - col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Bold Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Italic Heading
|
||||||
|
- - col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data a
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data b
|
||||||
|
num_cols: 2
|
||||||
|
num_rows: 2
|
||||||
|
table_cells:
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Bold Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Italic Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data a
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data b
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Bold Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Italic Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data a
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data b
|
||||||
|
footnotes: []
|
||||||
|
label: table
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
prov: []
|
||||||
|
references: []
|
||||||
|
self_ref: '#/tables/0'
|
||||||
texts:
|
texts:
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
@ -259,7 +430,7 @@ texts:
|
|||||||
content_layer: body
|
content_layer: body
|
||||||
enumerated: true
|
enumerated: true
|
||||||
label: list_item
|
label: list_item
|
||||||
marker: '-'
|
marker: ''
|
||||||
orig: ''
|
orig: ''
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/2'
|
$ref: '#/groups/2'
|
||||||
@ -305,7 +476,7 @@ texts:
|
|||||||
content_layer: body
|
content_layer: body
|
||||||
enumerated: true
|
enumerated: true
|
||||||
label: list_item
|
label: list_item
|
||||||
marker: '-'
|
marker: ''
|
||||||
orig: ''
|
orig: ''
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/2'
|
$ref: '#/groups/2'
|
||||||
@ -348,7 +519,7 @@ texts:
|
|||||||
content_layer: body
|
content_layer: body
|
||||||
enumerated: true
|
enumerated: true
|
||||||
label: list_item
|
label: list_item
|
||||||
marker: '-'
|
marker: ''
|
||||||
orig: ''
|
orig: ''
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/2'
|
$ref: '#/groups/2'
|
||||||
@ -391,7 +562,7 @@ texts:
|
|||||||
content_layer: body
|
content_layer: body
|
||||||
enumerated: true
|
enumerated: true
|
||||||
label: list_item
|
label: list_item
|
||||||
marker: '-'
|
marker: ''
|
||||||
orig: ''
|
orig: ''
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/2'
|
$ref: '#/groups/2'
|
||||||
@ -433,24 +604,51 @@ texts:
|
|||||||
content_layer: body
|
content_layer: body
|
||||||
enumerated: true
|
enumerated: true
|
||||||
label: list_item
|
label: list_item
|
||||||
marker: '-'
|
marker: ''
|
||||||
orig: Open a Pull Request
|
orig: Open a Pull Request
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/2'
|
$ref: '#/groups/2'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/26'
|
self_ref: '#/texts/26'
|
||||||
text: Open a Pull Request
|
text: Open a Pull Request
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
enumerated: true
|
||||||
|
formatting:
|
||||||
|
bold: true
|
||||||
|
italic: false
|
||||||
|
script: baseline
|
||||||
|
strikethrough: false
|
||||||
|
underline: false
|
||||||
|
label: list_item
|
||||||
|
marker: ''
|
||||||
|
orig: Whole list item has same formatting
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/2'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/27'
|
||||||
|
text: Whole list item has same formatting
|
||||||
- children:
|
- children:
|
||||||
- $ref: '#/groups/7'
|
- $ref: '#/groups/7'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: section_header
|
enumerated: true
|
||||||
level: 1
|
label: list_item
|
||||||
|
marker: ''
|
||||||
orig: ''
|
orig: ''
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/body'
|
$ref: '#/groups/2'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/27'
|
self_ref: '#/texts/28'
|
||||||
text: ''
|
text: ''
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: text
|
||||||
|
orig: List item has
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/7'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/29'
|
||||||
|
text: List item has
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
formatting:
|
formatting:
|
||||||
@ -460,69 +658,84 @@ texts:
|
|||||||
strikethrough: false
|
strikethrough: false
|
||||||
underline: false
|
underline: false
|
||||||
label: text
|
label: text
|
||||||
orig: Second
|
orig: mixed or partial
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/7'
|
$ref: '#/groups/7'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/28'
|
self_ref: '#/texts/30'
|
||||||
text: Second
|
text: mixed or partial
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: text
|
label: text
|
||||||
orig: section
|
orig: formatting
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/7'
|
$ref: '#/groups/7'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/29'
|
self_ref: '#/texts/31'
|
||||||
text: section
|
text: formatting
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
formatting:
|
||||||
|
bold: false
|
||||||
|
italic: true
|
||||||
|
script: baseline
|
||||||
|
strikethrough: false
|
||||||
|
underline: false
|
||||||
|
label: title
|
||||||
|
orig: Whole heading is italic
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/32'
|
||||||
|
text: Whole heading is italic
|
||||||
- children:
|
- children:
|
||||||
- $ref: '#/groups/9'
|
- $ref: '#/groups/9'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
enumerated: false
|
enumerated: false
|
||||||
label: list_item
|
label: list_item
|
||||||
marker: '-'
|
marker: ''
|
||||||
orig: ''
|
|
||||||
parent:
|
|
||||||
$ref: '#/groups/8'
|
|
||||||
prov: []
|
|
||||||
self_ref: '#/texts/30'
|
|
||||||
text: ''
|
|
||||||
- children: []
|
|
||||||
content_layer: body
|
|
||||||
formatting:
|
|
||||||
bold: true
|
|
||||||
italic: false
|
|
||||||
script: baseline
|
|
||||||
strikethrough: false
|
|
||||||
underline: false
|
|
||||||
label: text
|
|
||||||
orig: First
|
|
||||||
parent:
|
|
||||||
$ref: '#/groups/9'
|
|
||||||
prov: []
|
|
||||||
self_ref: '#/texts/31'
|
|
||||||
text: First
|
|
||||||
- children: []
|
|
||||||
content_layer: body
|
|
||||||
label: text
|
|
||||||
orig: ': Lorem ipsum.'
|
|
||||||
parent:
|
|
||||||
$ref: '#/groups/9'
|
|
||||||
prov: []
|
|
||||||
self_ref: '#/texts/32'
|
|
||||||
text: ': Lorem ipsum.'
|
|
||||||
- children:
|
|
||||||
- $ref: '#/groups/10'
|
|
||||||
content_layer: body
|
|
||||||
enumerated: false
|
|
||||||
label: list_item
|
|
||||||
marker: '-'
|
|
||||||
orig: ''
|
orig: ''
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/8'
|
$ref: '#/groups/8'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/33'
|
self_ref: '#/texts/33'
|
||||||
text: ''
|
text: ''
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
formatting:
|
||||||
|
bold: true
|
||||||
|
italic: false
|
||||||
|
script: baseline
|
||||||
|
strikethrough: false
|
||||||
|
underline: false
|
||||||
|
label: text
|
||||||
|
orig: First
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/9'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/34'
|
||||||
|
text: First
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: text
|
||||||
|
orig: ': Lorem ipsum.'
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/9'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/35'
|
||||||
|
text: ': Lorem ipsum.'
|
||||||
|
- children:
|
||||||
|
- $ref: '#/groups/10'
|
||||||
|
content_layer: body
|
||||||
|
enumerated: false
|
||||||
|
label: list_item
|
||||||
|
marker: ''
|
||||||
|
orig: ''
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/8'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/36'
|
||||||
|
text: ''
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
formatting:
|
formatting:
|
||||||
@ -536,7 +749,7 @@ texts:
|
|||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/10'
|
$ref: '#/groups/10'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/34'
|
self_ref: '#/texts/37'
|
||||||
text: Second
|
text: Second
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
@ -545,7 +758,7 @@ texts:
|
|||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/10'
|
$ref: '#/groups/10'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/35'
|
self_ref: '#/texts/38'
|
||||||
text: ': Dolor'
|
text: ': Dolor'
|
||||||
- captions: []
|
- captions: []
|
||||||
children: []
|
children: []
|
||||||
@ -558,7 +771,7 @@ texts:
|
|||||||
$ref: '#/groups/10'
|
$ref: '#/groups/10'
|
||||||
prov: []
|
prov: []
|
||||||
references: []
|
references: []
|
||||||
self_ref: '#/texts/36'
|
self_ref: '#/texts/39'
|
||||||
text: sit
|
text: sit
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
@ -567,6 +780,102 @@ texts:
|
|||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/10'
|
$ref: '#/groups/10'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/37'
|
self_ref: '#/texts/40'
|
||||||
text: amet.
|
text: amet.
|
||||||
version: 1.4.0
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: text
|
||||||
|
orig: Some
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/11'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/41'
|
||||||
|
text: Some
|
||||||
|
- captions: []
|
||||||
|
children: []
|
||||||
|
code_language: unknown
|
||||||
|
content_layer: body
|
||||||
|
footnotes: []
|
||||||
|
formatting:
|
||||||
|
bold: false
|
||||||
|
italic: true
|
||||||
|
script: baseline
|
||||||
|
strikethrough: false
|
||||||
|
underline: false
|
||||||
|
label: code
|
||||||
|
orig: formatted_code
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/11'
|
||||||
|
prov: []
|
||||||
|
references: []
|
||||||
|
self_ref: '#/texts/42'
|
||||||
|
text: formatted_code
|
||||||
|
- children:
|
||||||
|
- $ref: '#/groups/12'
|
||||||
|
content_layer: body
|
||||||
|
label: section_header
|
||||||
|
level: 1
|
||||||
|
orig: ''
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/43'
|
||||||
|
text: ''
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
formatting:
|
||||||
|
bold: false
|
||||||
|
italic: true
|
||||||
|
script: baseline
|
||||||
|
strikethrough: false
|
||||||
|
underline: false
|
||||||
|
label: text
|
||||||
|
orig: Partially formatted
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/12'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/44'
|
||||||
|
text: Partially formatted
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: text
|
||||||
|
orig: heading to_escape
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/12'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/45'
|
||||||
|
text: heading to_escape
|
||||||
|
- captions: []
|
||||||
|
children: []
|
||||||
|
code_language: unknown
|
||||||
|
content_layer: body
|
||||||
|
footnotes: []
|
||||||
|
label: code
|
||||||
|
orig: not_to_escape
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/12'
|
||||||
|
prov: []
|
||||||
|
references: []
|
||||||
|
self_ref: '#/texts/46'
|
||||||
|
text: not_to_escape
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
hyperlink: https://en.wikipedia.org/wiki/Albert_Einstein
|
||||||
|
label: text
|
||||||
|
orig: $$E=mc^2$$
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/47'
|
||||||
|
text: $$E=mc^2$$
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: section_header
|
||||||
|
level: 1
|
||||||
|
orig: Table Heading
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/48'
|
||||||
|
text: Table Heading
|
||||||
|
version: 1.5.0
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "ipa20180000016.xml",
|
"name": "ipa20180000016.xml",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/xml",
|
"mimetype": "application/xml",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "ipa20200022300.xml",
|
"name": "ipa20200022300.xml",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/xml",
|
"mimetype": "application/xml",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "lorem_ipsum",
|
"name": "lorem_ipsum",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "multi_page",
|
"name": "multi_page",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
20
tests/data/groundtruth/docling_v2/multi_page.md
vendored
20
tests/data/groundtruth/docling_v2/multi_page.md
vendored
@ -52,11 +52,11 @@ In addition to general-purpose word processors, specialized tools have emerged t
|
|||||||
|
|
||||||
The evolution of word processors wasn't just about hardware or software improvements-it was about the features that revolutionized how people wrote and edited. Some of these transformative features include:
|
The evolution of word processors wasn't just about hardware or software improvements-it was about the features that revolutionized how people wrote and edited. Some of these transformative features include:
|
||||||
|
|
||||||
- Undo/Redo : Introduced in the 1980s, the ability to undo mistakes and redo actions made experimentation and error correction much easier.
|
1. Undo/Redo : Introduced in the 1980s, the ability to undo mistakes and redo actions made experimentation and error correction much easier.
|
||||||
- Spell Check and Grammar Check : By the 1990s, these became standard, allowing users to spot errors automatically.
|
2. Spell Check and Grammar Check : By the 1990s, these became standard, allowing users to spot errors automatically.
|
||||||
- Templates : Pre-designed formats for documents, such as resumes, letters, and invoices, helped users save time.
|
3. Templates : Pre-designed formats for documents, such as resumes, letters, and invoices, helped users save time.
|
||||||
- Track Changes : A game-changer for collaboration, this feature allowed multiple users to suggest edits while maintaining the original text.
|
4. Track Changes : A game-changer for collaboration, this feature allowed multiple users to suggest edits while maintaining the original text.
|
||||||
- Real-Time Collaboration : Tools like Google Docs and Microsoft 365 enabled multiple users to edit the same document simultaneously, forever changing teamwork dynamics.
|
5. Real-Time Collaboration : Tools like Google Docs and Microsoft 365 enabled multiple users to edit the same document simultaneously, forever changing teamwork dynamics.
|
||||||
|
|
||||||
## The Cultural Impact of Word Processors
|
## The Cultural Impact of Word Processors
|
||||||
|
|
||||||
@ -70,11 +70,11 @@ The word processor didn't just change workplaces-it changed culture. It democrat
|
|||||||
|
|
||||||
As we move further into the 21st century, the role of the word processor continues to evolve:
|
As we move further into the 21st century, the role of the word processor continues to evolve:
|
||||||
|
|
||||||
- Artificial Intelligence : Modern word processors are leveraging AI to suggest content improvements. Tools like Grammarly, ProWritingAid, and even native features in Word now analyze tone, conciseness, and clarity. Some AI systems can even generate entire paragraphs or rewrite sentences.
|
1. Artificial Intelligence : Modern word processors are leveraging AI to suggest content improvements. Tools like Grammarly, ProWritingAid, and even native features in Word now analyze tone, conciseness, and clarity. Some AI systems can even generate entire paragraphs or rewrite sentences.
|
||||||
- Integration with Other Tools : Word processors are no longer standalone. They integrate with task managers, cloud storage, and project management platforms. For instance, Google Docs syncs with Google Drive, while Microsoft Word integrates seamlessly with OneDrive and Teams.
|
2. Integration with Other Tools : Word processors are no longer standalone. They integrate with task managers, cloud storage, and project management platforms. For instance, Google Docs syncs with Google Drive, while Microsoft Word integrates seamlessly with OneDrive and Teams.
|
||||||
- Voice Typing : Speech-to-text capabilities have made word processing more accessible, particularly for those with disabilities. Tools like Dragon NaturallySpeaking and built-in options in Google Docs and Microsoft Word have made dictation mainstream.
|
3. Voice Typing : Speech-to-text capabilities have made word processing more accessible, particularly for those with disabilities. Tools like Dragon NaturallySpeaking and built-in options in Google Docs and Microsoft Word have made dictation mainstream.
|
||||||
- Multimedia Documents : Word processing has expanded beyond text. Modern tools allow users to embed images, videos, charts, and interactive elements, transforming simple documents into rich multimedia experiences.
|
4. Multimedia Documents : Word processing has expanded beyond text. Modern tools allow users to embed images, videos, charts, and interactive elements, transforming simple documents into rich multimedia experiences.
|
||||||
- Cross-Platform Accessibility : Thanks to cloud computing, documents can now be accessed and edited across devices. Whether you're on a desktop, tablet, or smartphone, you can continue working seamlessly.
|
5. Cross-Platform Accessibility : Thanks to cloud computing, documents can now be accessed and edited across devices. Whether you're on a desktop, tablet, or smartphone, you can continue working seamlessly.
|
||||||
|
|
||||||
## A Glimpse Into the Future
|
## A Glimpse Into the Future
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "pa20010031492.xml",
|
"name": "pa20010031492.xml",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/xml",
|
"mimetype": "application/xml",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "pftaps057006474.txt",
|
"name": "pftaps057006474.txt",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/plain",
|
"mimetype": "text/plain",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "pg06442728.xml",
|
"name": "pg06442728.xml",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/xml",
|
"mimetype": "application/xml",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "picture_classification",
|
"name": "picture_classification",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "powerpoint_bad_text",
|
"name": "powerpoint_bad_text",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.ms-powerpoint",
|
"mimetype": "application/vnd.ms-powerpoint",
|
||||||
|
@ -11,7 +11,7 @@ item-0 at level 0: unspecified: group _root_
|
|||||||
item-10 at level 2: paragraph: And baz things
|
item-10 at level 2: paragraph: And baz things
|
||||||
item-11 at level 2: paragraph: A rectangle shape with this text inside.
|
item-11 at level 2: paragraph: A rectangle shape with this text inside.
|
||||||
item-12 at level 1: chapter: group slide-2
|
item-12 at level 1: chapter: group slide-2
|
||||||
item-13 at level 2: ordered_list: group list
|
item-13 at level 2: list: group list
|
||||||
item-14 at level 3: list_item: List item4
|
item-14 at level 3: list_item: List item4
|
||||||
item-15 at level 3: list_item: List item5
|
item-15 at level 3: list_item: List item5
|
||||||
item-16 at level 3: list_item: List item6
|
item-16 at level 3: list_item: List item6
|
||||||
@ -25,7 +25,7 @@ item-0 at level 0: unspecified: group _root_
|
|||||||
item-24 at level 3: list_item: Item A
|
item-24 at level 3: list_item: Item A
|
||||||
item-25 at level 3: list_item: Item B
|
item-25 at level 3: list_item: Item B
|
||||||
item-26 at level 2: paragraph: Maybe a list?
|
item-26 at level 2: paragraph: Maybe a list?
|
||||||
item-27 at level 2: ordered_list: group list
|
item-27 at level 2: list: group list
|
||||||
item-28 at level 3: list_item: List1
|
item-28 at level 3: list_item: List1
|
||||||
item-29 at level 3: list_item: List2
|
item-29 at level 3: list_item: List2
|
||||||
item-30 at level 3: list_item: List3
|
item-30 at level 3: list_item: List3
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "powerpoint_sample",
|
"name": "powerpoint_sample",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.ms-powerpoint",
|
"mimetype": "application/vnd.ms-powerpoint",
|
||||||
@ -137,7 +137,7 @@
|
|||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"name": "list",
|
"name": "list",
|
||||||
"label": "ordered_list"
|
"label": "list"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/groups/4",
|
"self_ref": "#/groups/4",
|
||||||
@ -197,7 +197,7 @@
|
|||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"name": "list",
|
"name": "list",
|
||||||
"label": "ordered_list"
|
"label": "list"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/groups/7",
|
"self_ref": "#/groups/7",
|
||||||
@ -578,7 +578,7 @@
|
|||||||
"orig": "I1",
|
"orig": "I1",
|
||||||
"text": "I1",
|
"text": "I1",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/13",
|
"self_ref": "#/texts/13",
|
||||||
@ -607,7 +607,7 @@
|
|||||||
"orig": "I2",
|
"orig": "I2",
|
||||||
"text": "I2",
|
"text": "I2",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/14",
|
"self_ref": "#/texts/14",
|
||||||
@ -636,7 +636,7 @@
|
|||||||
"orig": "I3",
|
"orig": "I3",
|
||||||
"text": "I3",
|
"text": "I3",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/15",
|
"self_ref": "#/texts/15",
|
||||||
@ -665,7 +665,7 @@
|
|||||||
"orig": "I4",
|
"orig": "I4",
|
||||||
"text": "I4",
|
"text": "I4",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/16",
|
"self_ref": "#/texts/16",
|
||||||
@ -721,7 +721,7 @@
|
|||||||
"orig": "Item A",
|
"orig": "Item A",
|
||||||
"text": "Item A",
|
"text": "Item A",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/18",
|
"self_ref": "#/texts/18",
|
||||||
@ -750,7 +750,7 @@
|
|||||||
"orig": "Item B",
|
"orig": "Item B",
|
||||||
"text": "Item B",
|
"text": "Item B",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/19",
|
"self_ref": "#/texts/19",
|
||||||
@ -893,7 +893,7 @@
|
|||||||
"orig": "l1 ",
|
"orig": "l1 ",
|
||||||
"text": "l1 ",
|
"text": "l1 ",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/24",
|
"self_ref": "#/texts/24",
|
||||||
@ -922,7 +922,7 @@
|
|||||||
"orig": "l2",
|
"orig": "l2",
|
||||||
"text": "l2",
|
"text": "l2",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/25",
|
"self_ref": "#/texts/25",
|
||||||
@ -951,7 +951,7 @@
|
|||||||
"orig": "l3",
|
"orig": "l3",
|
||||||
"text": "l3",
|
"text": "l3",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/26",
|
"self_ref": "#/texts/26",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "powerpoint_with_image",
|
"name": "powerpoint_with_image",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.ms-powerpoint",
|
"mimetype": "application/vnd.ms-powerpoint",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "redp5110_sampled",
|
"name": "redp5110_sampled",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
@ -1295,7 +1295,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/15",
|
"self_ref": "#/texts/15",
|
||||||
@ -1326,7 +1326,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/16",
|
"self_ref": "#/texts/16",
|
||||||
@ -1357,7 +1357,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/17",
|
"self_ref": "#/texts/17",
|
||||||
@ -1388,7 +1388,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/18",
|
"self_ref": "#/texts/18",
|
||||||
@ -1683,7 +1683,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/28",
|
"self_ref": "#/texts/28",
|
||||||
@ -1714,7 +1714,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/29",
|
"self_ref": "#/texts/29",
|
||||||
@ -1745,7 +1745,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/30",
|
"self_ref": "#/texts/30",
|
||||||
@ -1776,7 +1776,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/31",
|
"self_ref": "#/texts/31",
|
||||||
@ -1807,7 +1807,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/32",
|
"self_ref": "#/texts/32",
|
||||||
@ -1838,7 +1838,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/33",
|
"self_ref": "#/texts/33",
|
||||||
@ -1869,7 +1869,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/34",
|
"self_ref": "#/texts/34",
|
||||||
@ -1900,7 +1900,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/35",
|
"self_ref": "#/texts/35",
|
||||||
@ -1931,7 +1931,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/36",
|
"self_ref": "#/texts/36",
|
||||||
@ -2400,7 +2400,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/52",
|
"self_ref": "#/texts/52",
|
||||||
@ -2431,7 +2431,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/53",
|
"self_ref": "#/texts/53",
|
||||||
@ -2462,7 +2462,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/54",
|
"self_ref": "#/texts/54",
|
||||||
@ -2668,7 +2668,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/61",
|
"self_ref": "#/texts/61",
|
||||||
@ -2699,7 +2699,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/62",
|
"self_ref": "#/texts/62",
|
||||||
@ -2759,7 +2759,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/64",
|
"self_ref": "#/texts/64",
|
||||||
@ -3344,7 +3344,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/84",
|
"self_ref": "#/texts/84",
|
||||||
@ -3375,7 +3375,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/85",
|
"self_ref": "#/texts/85",
|
||||||
@ -3406,7 +3406,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/86",
|
"self_ref": "#/texts/86",
|
||||||
@ -5992,7 +5992,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/175",
|
"self_ref": "#/texts/175",
|
||||||
@ -6023,7 +6023,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/176",
|
"self_ref": "#/texts/176",
|
||||||
@ -6054,7 +6054,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/177",
|
"self_ref": "#/texts/177",
|
||||||
@ -6085,7 +6085,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/178",
|
"self_ref": "#/texts/178",
|
||||||
@ -6116,7 +6116,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/179",
|
"self_ref": "#/texts/179",
|
||||||
@ -7095,7 +7095,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/212",
|
"self_ref": "#/texts/212",
|
||||||
@ -7126,7 +7126,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/213",
|
"self_ref": "#/texts/213",
|
||||||
@ -7157,7 +7157,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/214",
|
"self_ref": "#/texts/214",
|
||||||
@ -7188,7 +7188,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/215",
|
"self_ref": "#/texts/215",
|
||||||
@ -7219,7 +7219,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/216",
|
"self_ref": "#/texts/216",
|
||||||
@ -7559,7 +7559,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/227",
|
"self_ref": "#/texts/227",
|
||||||
@ -7590,7 +7590,7 @@
|
|||||||
"formatting": null,
|
"formatting": null,
|
||||||
"hyperlink": null,
|
"hyperlink": null,
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/228",
|
"self_ref": "#/texts/228",
|
||||||
|
@ -318,9 +318,9 @@ If a special register value is in the list of user profiles or it is a member of
|
|||||||
|
|
||||||
Here is an example of using the VERIFY\_GROUP\_FOR\_USER function:
|
Here is an example of using the VERIFY\_GROUP\_FOR\_USER function:
|
||||||
|
|
||||||
- There are user profiles for MGR, JANE, JUDY, and TONY.
|
1. There are user profiles for MGR, JANE, JUDY, and TONY.
|
||||||
- The user profile JANE specifies a group profile of MGR.
|
2. The user profile JANE specifies a group profile of MGR.
|
||||||
- If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:
|
3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:
|
||||||
|
|
||||||
```
|
```
|
||||||
VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', 'STEVE') The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')
|
VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', 'STEVE') The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')
|
||||||
@ -334,7 +334,7 @@ CASE
|
|||||||
WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;
|
WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;
|
||||||
```
|
```
|
||||||
|
|
||||||
- The other column to mask in this example is the TAX\_ID information. In this example, the rules to enforce include the following ones:
|
2. The other column to mask in this example is the TAX\_ID information. In this example, the rules to enforce include the following ones:
|
||||||
- -Human Resources can see the unmasked TAX\_ID of the employees.
|
- -Human Resources can see the unmasked TAX\_ID of the employees.
|
||||||
- -Employees can see only their own unmasked TAX\_ID.
|
- -Employees can see only their own unmasked TAX\_ID.
|
||||||
- -Managers see a masked version of TAX\_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).
|
- -Managers see a masked version of TAX\_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).
|
||||||
@ -347,7 +347,7 @@ CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYE
|
|||||||
|
|
||||||
Example 3-9 Creating a mask on the TAX\_ID column
|
Example 3-9 Creating a mask on the TAX\_ID column
|
||||||
|
|
||||||
- Figure 3-10 shows the masks that are created in the HR\_SCHEMA.
|
3. Figure 3-10 shows the masks that are created in the HR\_SCHEMA.
|
||||||
|
|
||||||
Figure 3-10 Column masks shown in System i Navigator
|
Figure 3-10 Column masks shown in System i Navigator
|
||||||
|
|
||||||
@ -357,7 +357,7 @@ Figure 3-10 Column masks shown in System i Navigator
|
|||||||
|
|
||||||
Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:
|
Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:
|
||||||
|
|
||||||
- Run the SQL statements that are shown in Example 3-10.
|
1. Run the SQL statements that are shown in Example 3-10.
|
||||||
|
|
||||||
## Example 3-10 Activating RCAC on the EMPLOYEES table
|
## Example 3-10 Activating RCAC on the EMPLOYEES table
|
||||||
|
|
||||||
@ -372,14 +372,14 @@ ACTIVATE ROW ACCESS CONTROL
|
|||||||
|
|
||||||
ACTIVATE COLUMN ACCESS CONTROL;
|
ACTIVATE COLUMN ACCESS CONTROL;
|
||||||
|
|
||||||
- Look at the definition of the EMPLOYEE table, as shown in Figure 3-11. To do this, from the main navigation pane of System i Navigator, click Schemas HR\_SCHEMA Tables , right-click the EMPLOYEES table, and click Definition .
|
2. Look at the definition of the EMPLOYEE table, as shown in Figure 3-11. To do this, from the main navigation pane of System i Navigator, click Schemas HR\_SCHEMA Tables , right-click the EMPLOYEES table, and click Definition .
|
||||||
|
|
||||||
Figure 3-11 Selecting the EMPLOYEES table from System i Navigator
|
Figure 3-11 Selecting the EMPLOYEES table from System i Navigator
|
||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
- Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.
|
2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.
|
||||||
- Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.
|
3. Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.
|
||||||
|
|
||||||
Figure 4-68 Visual Explain with RCAC enabled
|
Figure 4-68 Visual Explain with RCAC enabled
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "right_to_left_01",
|
"name": "right_to_left_01",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "right_to_left_02",
|
"name": "right_to_left_02",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "right_to_left_03",
|
"name": "right_to_left_03",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "sample_sales_data",
|
"name": "sample_sales_data",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
"mimetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "tablecell",
|
"name": "tablecell",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
@ -82,7 +82,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/1",
|
"self_ref": "#/texts/1",
|
||||||
@ -103,7 +103,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/2",
|
"self_ref": "#/texts/2",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "test-01",
|
"name": "test-01",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
"mimetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "test_emf_docx",
|
"name": "test_emf_docx",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
|
@ -29,64 +29,62 @@ item-0 at level 0: unspecified: group _root_
|
|||||||
item-24 at level 3: list_item: A report must also be submitted ... d Infectious Disease Reporting System.
|
item-24 at level 3: list_item: A report must also be submitted ... d Infectious Disease Reporting System.
|
||||||
item-25 at level 2: paragraph:
|
item-25 at level 2: paragraph:
|
||||||
item-26 at level 1: list: group list
|
item-26 at level 1: list: group list
|
||||||
item-27 at level 2: list_item:
|
item-27 at level 1: paragraph:
|
||||||
item-28 at level 1: paragraph:
|
item-28 at level 1: paragraph:
|
||||||
item-29 at level 1: paragraph:
|
item-29 at level 1: paragraph:
|
||||||
item-30 at level 1: paragraph:
|
item-30 at level 1: paragraph:
|
||||||
item-31 at level 1: paragraph:
|
item-31 at level 1: paragraph:
|
||||||
item-32 at level 1: paragraph:
|
item-32 at level 1: section: group textbox
|
||||||
item-33 at level 1: section: group textbox
|
item-33 at level 2: paragraph: Health Bureau:
|
||||||
item-34 at level 2: paragraph: Health Bureau:
|
item-34 at level 2: paragraph: Upon receiving a report from the ... rt to the Centers for Disease Control.
|
||||||
item-35 at level 2: paragraph: Upon receiving a report from the ... rt to the Centers for Disease Control.
|
item-35 at level 2: list: group list
|
||||||
item-36 at level 2: list: group list
|
item-36 at level 3: list_item: If necessary, provide health edu ... vidual to undergo specimen collection.
|
||||||
item-37 at level 3: list_item: If necessary, provide health edu ... vidual to undergo specimen collection.
|
item-37 at level 3: list_item: Implement appropriate epidemic p ... the Communicable Disease Control Act.
|
||||||
item-38 at level 3: list_item: Implement appropriate epidemic p ... the Communicable Disease Control Act.
|
item-38 at level 2: paragraph:
|
||||||
item-39 at level 2: paragraph:
|
item-39 at level 1: list: group list
|
||||||
item-40 at level 1: list: group list
|
item-40 at level 1: paragraph:
|
||||||
item-41 at level 2: list_item:
|
item-41 at level 1: section: group textbox
|
||||||
item-42 at level 1: paragraph:
|
item-42 at level 2: paragraph: Department of Education:
|
||||||
item-43 at level 1: section: group textbox
|
|
||||||
item-44 at level 2: paragraph: Department of Education:
|
|
||||||
Collabo ... vention measures at all school levels.
|
Collabo ... vention measures at all school levels.
|
||||||
|
item-43 at level 1: paragraph:
|
||||||
|
item-44 at level 1: paragraph:
|
||||||
item-45 at level 1: paragraph:
|
item-45 at level 1: paragraph:
|
||||||
item-46 at level 1: paragraph:
|
item-46 at level 1: paragraph:
|
||||||
item-47 at level 1: paragraph:
|
item-47 at level 1: paragraph:
|
||||||
item-48 at level 1: paragraph:
|
item-48 at level 1: paragraph:
|
||||||
item-49 at level 1: paragraph:
|
item-49 at level 1: paragraph:
|
||||||
item-50 at level 1: paragraph:
|
item-50 at level 1: section: group textbox
|
||||||
item-51 at level 1: paragraph:
|
item-51 at level 2: inline: group group
|
||||||
item-52 at level 1: section: group textbox
|
item-52 at level 3: paragraph: The Health Bureau will handle
|
||||||
item-53 at level 2: inline: group group
|
item-53 at level 3: paragraph: reporting and specimen collection
|
||||||
item-54 at level 3: paragraph: The Health Bureau will handle
|
item-54 at level 3: paragraph: .
|
||||||
item-55 at level 3: paragraph: reporting and specimen collection
|
item-55 at level 2: paragraph:
|
||||||
item-56 at level 3: paragraph: .
|
item-56 at level 1: paragraph:
|
||||||
item-57 at level 2: paragraph:
|
item-57 at level 1: paragraph:
|
||||||
item-58 at level 1: paragraph:
|
item-58 at level 1: paragraph:
|
||||||
item-59 at level 1: paragraph:
|
item-59 at level 1: section: group textbox
|
||||||
item-60 at level 1: paragraph:
|
item-60 at level 2: paragraph: Whether the epidemic has eased.
|
||||||
item-61 at level 1: section: group textbox
|
item-61 at level 2: paragraph:
|
||||||
item-62 at level 2: paragraph: Whether the epidemic has eased.
|
item-62 at level 1: paragraph:
|
||||||
item-63 at level 2: paragraph:
|
item-63 at level 1: section: group textbox
|
||||||
item-64 at level 1: paragraph:
|
item-64 at level 2: paragraph: Whether the test results are pos ... legally designated infectious disease.
|
||||||
item-65 at level 1: section: group textbox
|
item-65 at level 2: paragraph: No
|
||||||
item-66 at level 2: paragraph: Whether the test results are pos ... legally designated infectious disease.
|
item-66 at level 1: paragraph:
|
||||||
item-67 at level 2: paragraph: No
|
item-67 at level 1: paragraph:
|
||||||
item-68 at level 1: paragraph:
|
item-68 at level 1: section: group textbox
|
||||||
item-69 at level 1: paragraph:
|
item-69 at level 2: paragraph: Yes
|
||||||
item-70 at level 1: section: group textbox
|
item-70 at level 1: paragraph:
|
||||||
item-71 at level 2: paragraph: Yes
|
item-71 at level 1: section: group textbox
|
||||||
item-72 at level 1: paragraph:
|
item-72 at level 2: paragraph: Yes
|
||||||
item-73 at level 1: section: group textbox
|
item-73 at level 1: paragraph:
|
||||||
item-74 at level 2: paragraph: Yes
|
item-74 at level 1: paragraph:
|
||||||
item-75 at level 1: paragraph:
|
item-75 at level 1: section: group textbox
|
||||||
item-76 at level 1: paragraph:
|
item-76 at level 2: paragraph: Case closed.
|
||||||
item-77 at level 1: section: group textbox
|
item-77 at level 2: paragraph:
|
||||||
item-78 at level 2: paragraph: Case closed.
|
item-78 at level 2: paragraph: The Health Bureau will carry out ... ters for Disease Control if necessary.
|
||||||
item-79 at level 2: paragraph:
|
item-79 at level 1: paragraph:
|
||||||
item-80 at level 2: paragraph: The Health Bureau will carry out ... ters for Disease Control if necessary.
|
item-80 at level 1: section: group textbox
|
||||||
item-81 at level 1: paragraph:
|
item-81 at level 2: paragraph: No
|
||||||
item-82 at level 1: section: group textbox
|
item-82 at level 1: paragraph:
|
||||||
item-83 at level 2: paragraph: No
|
item-83 at level 1: paragraph:
|
||||||
item-84 at level 1: paragraph:
|
item-84 at level 1: paragraph:
|
||||||
item-85 at level 1: paragraph:
|
|
||||||
item-86 at level 1: paragraph:
|
|
434
tests/data/groundtruth/docling_v2/textbox.docx.json
vendored
434
tests/data/groundtruth/docling_v2/textbox.docx.json
vendored
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "textbox",
|
"name": "textbox",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
@ -65,6 +65,9 @@
|
|||||||
{
|
{
|
||||||
"$ref": "#/groups/6"
|
"$ref": "#/groups/6"
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/19"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/20"
|
"$ref": "#/texts/20"
|
||||||
},
|
},
|
||||||
@ -77,9 +80,6 @@
|
|||||||
{
|
{
|
||||||
"$ref": "#/texts/23"
|
"$ref": "#/texts/23"
|
||||||
},
|
},
|
||||||
{
|
|
||||||
"$ref": "#/texts/24"
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/7"
|
"$ref": "#/groups/7"
|
||||||
},
|
},
|
||||||
@ -87,11 +87,17 @@
|
|||||||
"$ref": "#/groups/9"
|
"$ref": "#/groups/9"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/31"
|
"$ref": "#/texts/29"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/10"
|
"$ref": "#/groups/10"
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/31"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/32"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/33"
|
"$ref": "#/texts/33"
|
||||||
},
|
},
|
||||||
@ -107,71 +113,65 @@
|
|||||||
{
|
{
|
||||||
"$ref": "#/texts/37"
|
"$ref": "#/texts/37"
|
||||||
},
|
},
|
||||||
{
|
|
||||||
"$ref": "#/texts/38"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"$ref": "#/texts/39"
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/11"
|
"$ref": "#/groups/11"
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/42"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/43"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/44"
|
"$ref": "#/texts/44"
|
||||||
},
|
},
|
||||||
{
|
|
||||||
"$ref": "#/texts/45"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"$ref": "#/texts/46"
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/13"
|
"$ref": "#/groups/13"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/49"
|
"$ref": "#/texts/47"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/14"
|
"$ref": "#/groups/14"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/52"
|
"$ref": "#/texts/50"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/53"
|
"$ref": "#/texts/51"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/15"
|
"$ref": "#/groups/15"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/55"
|
"$ref": "#/texts/53"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/16"
|
"$ref": "#/groups/16"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/57"
|
"$ref": "#/texts/55"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/58"
|
"$ref": "#/texts/56"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/17"
|
"$ref": "#/groups/17"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/62"
|
"$ref": "#/texts/60"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/18"
|
"$ref": "#/groups/18"
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/62"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/63"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/64"
|
"$ref": "#/texts/64"
|
||||||
},
|
|
||||||
{
|
|
||||||
"$ref": "#/texts/65"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"$ref": "#/texts/66"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -280,11 +280,7 @@
|
|||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/body"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
"children": [
|
"children": [],
|
||||||
{
|
|
||||||
"$ref": "#/texts/19"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"name": "list",
|
"name": "list",
|
||||||
"label": "list"
|
"label": "list"
|
||||||
@ -296,16 +292,16 @@
|
|||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/25"
|
"$ref": "#/texts/24"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/26"
|
"$ref": "#/texts/25"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/groups/8"
|
"$ref": "#/groups/8"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/29"
|
"$ref": "#/texts/28"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -319,10 +315,10 @@
|
|||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/27"
|
"$ref": "#/texts/26"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/28"
|
"$ref": "#/texts/27"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -334,11 +330,7 @@
|
|||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/body"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
"children": [
|
"children": [],
|
||||||
{
|
|
||||||
"$ref": "#/texts/30"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"name": "list",
|
"name": "list",
|
||||||
"label": "list"
|
"label": "list"
|
||||||
@ -350,7 +342,7 @@
|
|||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/32"
|
"$ref": "#/texts/30"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -367,7 +359,7 @@
|
|||||||
"$ref": "#/groups/12"
|
"$ref": "#/groups/12"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/43"
|
"$ref": "#/texts/41"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -380,14 +372,14 @@
|
|||||||
"$ref": "#/groups/11"
|
"$ref": "#/groups/11"
|
||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/38"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/39"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/40"
|
"$ref": "#/texts/40"
|
||||||
},
|
|
||||||
{
|
|
||||||
"$ref": "#/texts/41"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"$ref": "#/texts/42"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -401,10 +393,10 @@
|
|||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/47"
|
"$ref": "#/texts/45"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/48"
|
"$ref": "#/texts/46"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -418,10 +410,10 @@
|
|||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/50"
|
"$ref": "#/texts/48"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/51"
|
"$ref": "#/texts/49"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -435,7 +427,7 @@
|
|||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/54"
|
"$ref": "#/texts/52"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -449,7 +441,7 @@
|
|||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/56"
|
"$ref": "#/texts/54"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -462,14 +454,14 @@
|
|||||||
"$ref": "#/body"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/57"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/58"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/59"
|
"$ref": "#/texts/59"
|
||||||
},
|
|
||||||
{
|
|
||||||
"$ref": "#/texts/60"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"$ref": "#/texts/61"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -483,7 +475,7 @@
|
|||||||
},
|
},
|
||||||
"children": [
|
"children": [
|
||||||
{
|
{
|
||||||
"$ref": "#/texts/63"
|
"$ref": "#/texts/61"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -592,7 +584,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/6",
|
"self_ref": "#/texts/6",
|
||||||
@ -747,7 +739,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/17",
|
"self_ref": "#/texts/17",
|
||||||
@ -768,7 +760,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/18",
|
"self_ref": "#/texts/18",
|
||||||
@ -785,16 +777,14 @@
|
|||||||
{
|
{
|
||||||
"self_ref": "#/texts/19",
|
"self_ref": "#/texts/19",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/6"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
"children": [],
|
"children": [],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"label": "list_item",
|
"label": "paragraph",
|
||||||
"prov": [],
|
"prov": [],
|
||||||
"orig": "",
|
"orig": "",
|
||||||
"text": "",
|
"text": ""
|
||||||
"enumerated": false,
|
|
||||||
"marker": "-"
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/20",
|
"self_ref": "#/texts/20",
|
||||||
@ -846,18 +836,6 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/24",
|
"self_ref": "#/texts/24",
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/25",
|
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/7"
|
"$ref": "#/groups/7"
|
||||||
},
|
},
|
||||||
@ -876,7 +854,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/26",
|
"self_ref": "#/texts/25",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/7"
|
"$ref": "#/groups/7"
|
||||||
},
|
},
|
||||||
@ -895,7 +873,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/27",
|
"self_ref": "#/texts/26",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/8"
|
"$ref": "#/groups/8"
|
||||||
},
|
},
|
||||||
@ -913,10 +891,10 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/28",
|
"self_ref": "#/texts/27",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/8"
|
"$ref": "#/groups/8"
|
||||||
},
|
},
|
||||||
@ -934,10 +912,10 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/29",
|
"self_ref": "#/texts/28",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/7"
|
"$ref": "#/groups/7"
|
||||||
},
|
},
|
||||||
@ -949,21 +927,7 @@
|
|||||||
"text": ""
|
"text": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/30",
|
"self_ref": "#/texts/29",
|
||||||
"parent": {
|
|
||||||
"$ref": "#/groups/9"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "list_item",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": "",
|
|
||||||
"enumerated": false,
|
|
||||||
"marker": "-"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/31",
|
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/body"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
@ -975,7 +939,7 @@
|
|||||||
"text": ""
|
"text": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/32",
|
"self_ref": "#/texts/30",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/10"
|
"$ref": "#/groups/10"
|
||||||
},
|
},
|
||||||
@ -993,6 +957,30 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/31",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/32",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/33",
|
"self_ref": "#/texts/33",
|
||||||
"parent": {
|
"parent": {
|
||||||
@ -1055,30 +1043,6 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/38",
|
"self_ref": "#/texts/38",
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/39",
|
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/40",
|
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/12"
|
"$ref": "#/groups/12"
|
||||||
},
|
},
|
||||||
@ -1097,7 +1061,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/41",
|
"self_ref": "#/texts/39",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/12"
|
"$ref": "#/groups/12"
|
||||||
},
|
},
|
||||||
@ -1116,7 +1080,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/42",
|
"self_ref": "#/texts/40",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/12"
|
"$ref": "#/groups/12"
|
||||||
},
|
},
|
||||||
@ -1135,7 +1099,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/43",
|
"self_ref": "#/texts/41",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/11"
|
"$ref": "#/groups/11"
|
||||||
},
|
},
|
||||||
@ -1146,6 +1110,30 @@
|
|||||||
"orig": "",
|
"orig": "",
|
||||||
"text": ""
|
"text": ""
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/42",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/43",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/44",
|
"self_ref": "#/texts/44",
|
||||||
"parent": {
|
"parent": {
|
||||||
@ -1160,30 +1148,6 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/45",
|
"self_ref": "#/texts/45",
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/46",
|
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/47",
|
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/13"
|
"$ref": "#/groups/13"
|
||||||
},
|
},
|
||||||
@ -1202,7 +1166,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/48",
|
"self_ref": "#/texts/46",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/13"
|
"$ref": "#/groups/13"
|
||||||
},
|
},
|
||||||
@ -1214,7 +1178,7 @@
|
|||||||
"text": ""
|
"text": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/49",
|
"self_ref": "#/texts/47",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/body"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
@ -1226,7 +1190,7 @@
|
|||||||
"text": ""
|
"text": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/50",
|
"self_ref": "#/texts/48",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/14"
|
"$ref": "#/groups/14"
|
||||||
},
|
},
|
||||||
@ -1245,7 +1209,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/51",
|
"self_ref": "#/texts/49",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/14"
|
"$ref": "#/groups/14"
|
||||||
},
|
},
|
||||||
@ -1264,7 +1228,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/52",
|
"self_ref": "#/texts/50",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/body"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
@ -1275,6 +1239,37 @@
|
|||||||
"orig": "",
|
"orig": "",
|
||||||
"text": ""
|
"text": ""
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/51",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/52",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/15"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Yes",
|
||||||
|
"text": "Yes",
|
||||||
|
"formatting": {
|
||||||
|
"bold": false,
|
||||||
|
"italic": false,
|
||||||
|
"underline": false,
|
||||||
|
"strikethrough": false,
|
||||||
|
"script": "baseline"
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/53",
|
"self_ref": "#/texts/53",
|
||||||
"parent": {
|
"parent": {
|
||||||
@ -1290,7 +1285,7 @@
|
|||||||
{
|
{
|
||||||
"self_ref": "#/texts/54",
|
"self_ref": "#/texts/54",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/15"
|
"$ref": "#/groups/16"
|
||||||
},
|
},
|
||||||
"children": [],
|
"children": [],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
@ -1321,48 +1316,17 @@
|
|||||||
{
|
{
|
||||||
"self_ref": "#/texts/56",
|
"self_ref": "#/texts/56",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/16"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
"children": [],
|
"children": [],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"label": "paragraph",
|
"label": "paragraph",
|
||||||
"prov": [],
|
"prov": [],
|
||||||
"orig": "Yes",
|
"orig": "",
|
||||||
"text": "Yes",
|
"text": ""
|
||||||
"formatting": {
|
|
||||||
"bold": false,
|
|
||||||
"italic": false,
|
|
||||||
"underline": false,
|
|
||||||
"strikethrough": false,
|
|
||||||
"script": "baseline"
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/57",
|
"self_ref": "#/texts/57",
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/58",
|
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/59",
|
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/17"
|
"$ref": "#/groups/17"
|
||||||
},
|
},
|
||||||
@ -1381,7 +1345,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/60",
|
"self_ref": "#/texts/58",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/17"
|
"$ref": "#/groups/17"
|
||||||
},
|
},
|
||||||
@ -1393,7 +1357,7 @@
|
|||||||
"text": ""
|
"text": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/61",
|
"self_ref": "#/texts/59",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/17"
|
"$ref": "#/groups/17"
|
||||||
},
|
},
|
||||||
@ -1411,6 +1375,37 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/60",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/61",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/18"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "No",
|
||||||
|
"text": "No",
|
||||||
|
"formatting": {
|
||||||
|
"bold": false,
|
||||||
|
"italic": false,
|
||||||
|
"underline": false,
|
||||||
|
"strikethrough": false,
|
||||||
|
"script": "baseline"
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/62",
|
"self_ref": "#/texts/62",
|
||||||
"parent": {
|
"parent": {
|
||||||
@ -1426,21 +1421,14 @@
|
|||||||
{
|
{
|
||||||
"self_ref": "#/texts/63",
|
"self_ref": "#/texts/63",
|
||||||
"parent": {
|
"parent": {
|
||||||
"$ref": "#/groups/18"
|
"$ref": "#/body"
|
||||||
},
|
},
|
||||||
"children": [],
|
"children": [],
|
||||||
"content_layer": "body",
|
"content_layer": "body",
|
||||||
"label": "paragraph",
|
"label": "paragraph",
|
||||||
"prov": [],
|
"prov": [],
|
||||||
"orig": "No",
|
"orig": "",
|
||||||
"text": "No",
|
"text": ""
|
||||||
"formatting": {
|
|
||||||
"bold": false,
|
|
||||||
"italic": false,
|
|
||||||
"underline": false,
|
|
||||||
"strikethrough": false,
|
|
||||||
"script": "baseline"
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/64",
|
"self_ref": "#/texts/64",
|
||||||
@ -1453,30 +1441,6 @@
|
|||||||
"prov": [],
|
"prov": [],
|
||||||
"orig": "",
|
"orig": "",
|
||||||
"text": ""
|
"text": ""
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/65",
|
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"self_ref": "#/texts/66",
|
|
||||||
"parent": {
|
|
||||||
"$ref": "#/body"
|
|
||||||
},
|
|
||||||
"children": [],
|
|
||||||
"content_layer": "body",
|
|
||||||
"label": "paragraph",
|
|
||||||
"prov": [],
|
|
||||||
"orig": "",
|
|
||||||
"text": ""
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"pictures": [],
|
"pictures": [],
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "unit_test_01",
|
"name": "unit_test_01",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "text/html",
|
"mimetype": "text/html",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "unit_test_formatting",
|
"name": "unit_test_formatting",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
@ -429,7 +429,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/14",
|
"self_ref": "#/texts/14",
|
||||||
@ -450,7 +450,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/15",
|
"self_ref": "#/texts/15",
|
||||||
@ -471,7 +471,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/16",
|
"self_ref": "#/texts/16",
|
||||||
@ -489,7 +489,7 @@
|
|||||||
"orig": "",
|
"orig": "",
|
||||||
"text": "",
|
"text": "",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/17",
|
"self_ref": "#/texts/17",
|
||||||
@ -583,7 +583,7 @@
|
|||||||
"orig": "",
|
"orig": "",
|
||||||
"text": "",
|
"text": "",
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/22",
|
"self_ref": "#/texts/22",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "unit_test_headers",
|
"name": "unit_test_headers",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "unit_test_headers_numbered",
|
"name": "unit_test_headers_numbered",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "unit_test_lists",
|
"name": "unit_test_lists",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
@ -456,7 +456,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/9",
|
"self_ref": "#/texts/9",
|
||||||
@ -477,7 +477,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/10",
|
"self_ref": "#/texts/10",
|
||||||
@ -498,7 +498,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/11",
|
"self_ref": "#/texts/11",
|
||||||
@ -551,7 +551,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/14",
|
"self_ref": "#/texts/14",
|
||||||
@ -572,7 +572,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/15",
|
"self_ref": "#/texts/15",
|
||||||
@ -593,7 +593,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/16",
|
"self_ref": "#/texts/16",
|
||||||
@ -646,7 +646,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/19",
|
"self_ref": "#/texts/19",
|
||||||
@ -667,7 +667,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/20",
|
"self_ref": "#/texts/20",
|
||||||
@ -688,7 +688,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/21",
|
"self_ref": "#/texts/21",
|
||||||
@ -709,7 +709,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/22",
|
"self_ref": "#/texts/22",
|
||||||
@ -730,7 +730,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/23",
|
"self_ref": "#/texts/23",
|
||||||
@ -751,7 +751,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/24",
|
"self_ref": "#/texts/24",
|
||||||
@ -804,7 +804,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/27",
|
"self_ref": "#/texts/27",
|
||||||
@ -825,7 +825,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/28",
|
"self_ref": "#/texts/28",
|
||||||
@ -846,7 +846,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/29",
|
"self_ref": "#/texts/29",
|
||||||
@ -899,7 +899,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/32",
|
"self_ref": "#/texts/32",
|
||||||
@ -920,7 +920,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/33",
|
"self_ref": "#/texts/33",
|
||||||
@ -941,7 +941,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/34",
|
"self_ref": "#/texts/34",
|
||||||
@ -962,7 +962,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/35",
|
"self_ref": "#/texts/35",
|
||||||
@ -1021,7 +1021,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/38",
|
"self_ref": "#/texts/38",
|
||||||
@ -1042,7 +1042,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/39",
|
"self_ref": "#/texts/39",
|
||||||
@ -1063,7 +1063,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/40",
|
"self_ref": "#/texts/40",
|
||||||
@ -1084,7 +1084,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/41",
|
"self_ref": "#/texts/41",
|
||||||
@ -1105,7 +1105,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/42",
|
"self_ref": "#/texts/42",
|
||||||
@ -1126,7 +1126,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/43",
|
"self_ref": "#/texts/43",
|
||||||
|
@ -302,7 +302,7 @@ item-0 at level 0: unspecified: group _root_
|
|||||||
item-288 at level 4: list_item: Rubber duck
|
item-288 at level 4: list_item: Rubber duck
|
||||||
item-289 at level 2: section_header: Notes
|
item-289 at level 2: section_header: Notes
|
||||||
item-290 at level 3: section_header: Citations
|
item-290 at level 3: section_header: Citations
|
||||||
item-291 at level 4: ordered_list: group ordered list
|
item-291 at level 4: list: group ordered list
|
||||||
item-292 at level 5: list_item: ^ "Duckling". The American Herit ... n Company. 2006. Retrieved 2015-05-22.
|
item-292 at level 5: list_item: ^ "Duckling". The American Herit ... n Company. 2006. Retrieved 2015-05-22.
|
||||||
item-293 at level 5: list_item: ^ "Duckling". Kernerman English ... Ltd. 2000–2006. Retrieved 2015-05-22.
|
item-293 at level 5: list_item: ^ "Duckling". Kernerman English ... Ltd. 2000–2006. Retrieved 2015-05-22.
|
||||||
item-294 at level 5: list_item: ^ Dohner, Janet Vorwald (2001). ... University Press. ISBN 978-0300138139.
|
item-294 at level 5: list_item: ^ Dohner, Janet Vorwald (2001). ... University Press. ISBN 978-0300138139.
|
||||||
|
File diff suppressed because it is too large
Load Diff
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "word_image_anchors",
|
"name": "word_image_anchors",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "word_sample",
|
"name": "word_sample",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
@ -243,7 +243,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/7",
|
"self_ref": "#/texts/7",
|
||||||
@ -264,7 +264,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/8",
|
"self_ref": "#/texts/8",
|
||||||
@ -285,7 +285,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/9",
|
"self_ref": "#/texts/9",
|
||||||
@ -325,7 +325,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/11",
|
"self_ref": "#/texts/11",
|
||||||
@ -346,7 +346,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/12",
|
"self_ref": "#/texts/12",
|
||||||
@ -367,7 +367,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/13",
|
"self_ref": "#/texts/13",
|
||||||
@ -530,7 +530,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/21",
|
"self_ref": "#/texts/21",
|
||||||
@ -551,7 +551,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"self_ref": "#/texts/22",
|
"self_ref": "#/texts/22",
|
||||||
@ -572,7 +572,7 @@
|
|||||||
"script": "baseline"
|
"script": "baseline"
|
||||||
},
|
},
|
||||||
"enumerated": false,
|
"enumerated": false,
|
||||||
"marker": "-"
|
"marker": ""
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"pictures": [
|
"pictures": [
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "word_tables",
|
"name": "word_tables",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
|
4
tests/data/html/example_01.html
vendored
4
tests/data/html/example_01.html
vendored
@ -13,5 +13,9 @@
|
|||||||
<li>First item in ordered list</li>
|
<li>First item in ordered list</li>
|
||||||
<li>Second item in ordered list</li>
|
<li>Second item in ordered list</li>
|
||||||
</ol>
|
</ol>
|
||||||
|
<ol start="42">
|
||||||
|
<li>First item in ordered list with start</li>
|
||||||
|
<li>Second item in ordered list with start</li>
|
||||||
|
</ol>
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
|
16
tests/data/md/inline_and_formatting.md
vendored
16
tests/data/md/inline_and_formatting.md
vendored
@ -11,8 +11,22 @@ Create your feature branch: `git checkout -b feature/AmazingFeature`.
|
|||||||
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
|
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
|
||||||
4. Push to the branch (`git push origin feature/AmazingFeature`)
|
4. Push to the branch (`git push origin feature/AmazingFeature`)
|
||||||
5. Open a Pull Request
|
5. Open a Pull Request
|
||||||
|
6. **Whole list item has same formatting**
|
||||||
|
7. List item has *mixed or partial* formatting
|
||||||
|
|
||||||
## *Second* section <!-- inline groups in headings not yet supported by serializers -->
|
# *Whole heading is italic*
|
||||||
|
|
||||||
- **First**: Lorem ipsum.
|
- **First**: Lorem ipsum.
|
||||||
- **Second**: Dolor `sit` amet.
|
- **Second**: Dolor `sit` amet.
|
||||||
|
|
||||||
|
Some *`formatted_code`*
|
||||||
|
|
||||||
|
## *Partially formatted* heading to_escape `not_to_escape`
|
||||||
|
|
||||||
|
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
|
||||||
|
|
||||||
|
## Table Heading
|
||||||
|
|
||||||
|
| **Bold Heading** | *Italic Heading* |
|
||||||
|
|------------------|------------------|
|
||||||
|
| data a | data b |
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"schema_name": "DoclingDocument",
|
"schema_name": "DoclingDocument",
|
||||||
"version": "1.4.0",
|
"version": "1.5.0",
|
||||||
"name": "webp-test",
|
"name": "webp-test",
|
||||||
"origin": {
|
"origin": {
|
||||||
"mimetype": "application/pdf",
|
"mimetype": "application/pdf",
|
||||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user