Merge from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
Christoph Auer 2025-03-14 13:52:36 +01:00
commit 412c013d95
66 changed files with 2678 additions and 760 deletions

2
.github/SECURITY.md vendored
View File

@ -20,4 +20,4 @@ After the initial reply to your report, the security team will keep you informed
## Security Alerts ## Security Alerts
We will send announcements of security vulnerabilities and steps to remediate on the [Docling announcements](https://github.com/DS4SD/docling/discussions/categories/announcements). We will send announcements of security vulnerabilities and steps to remediate on the [Docling announcements](https://github.com/docling-project/docling/discussions/categories/announcements).

View File

@ -10,7 +10,7 @@ on:
jobs: jobs:
build-docs: build-docs:
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }} if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
uses: ./.github/workflows/docs.yml uses: ./.github/workflows/docs.yml
with: with:
deploy: false deploy: false

View File

@ -15,5 +15,5 @@ env:
jobs: jobs:
code-checks: code-checks:
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }} if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
uses: ./.github/workflows/checks.yml uses: ./.github/workflows/checks.yml

File diff suppressed because it is too large Load Diff

View File

@ -2,13 +2,13 @@
Our project welcomes external contributions. If you have an itch, please feel Our project welcomes external contributions. If you have an itch, please feel
free to scratch it. free to scratch it.
To contribute code or documentation, please submit a [pull request](https://github.com/DS4SD/docling/pulls). To contribute code or documentation, please submit a [pull request](https://github.com/docling-project/docling/pulls).
A good way to familiarize yourself with the codebase and contribution process is A good way to familiarize yourself with the codebase and contribution process is
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/DS4SD/docling/issues). to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/docling-project/docling/issues).
Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us. Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us.
For general questions or support requests, please refer to the [discussion section](https://github.com/DS4SD/docling/discussions). For general questions or support requests, please refer to the [discussion section](https://github.com/docling-project/docling/discussions).
**Note: We appreciate your effort and want to avoid situations where a contribution **Note: We appreciate your effort and want to avoid situations where a contribution
requires extensive rework (by you or by us), sits in the backlog for a long time, or requires extensive rework (by you or by us), sits in the backlog for a long time, or
@ -16,14 +16,14 @@ cannot be accepted at all!**
### Proposing New Features ### Proposing New Features
If you would like to implement a new feature, please [raise an issue](https://github.com/DS4SD/docling/issues) If you would like to implement a new feature, please [raise an issue](https://github.com/docling-project/docling/issues)
before sending a pull request so the feature can be discussed. This is to avoid before sending a pull request so the feature can be discussed. This is to avoid
you spending valuable time working on a feature that the project developers you spending valuable time working on a feature that the project developers
are not interested in accepting into the codebase. are not interested in accepting into the codebase.
### Fixing Bugs ### Fixing Bugs
If you would like to fix a bug, please [raise an issue](https://github.com/DS4SD/docling/issues) before sending a If you would like to fix a bug, please [raise an issue](https://github.com/docling-project/docling/issues) before sending a
pull request so it can be tracked. pull request so it can be tracked.
### Merge Approval ### Merge Approval
@ -78,7 +78,7 @@ This project strictly adheres to using dependencies that are compatible with the
## Communication ## Communication
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions). Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).

View File

@ -1,6 +1,6 @@
<p align="center"> <p align="center">
<a href="https://github.com/ds4sd/docling"> <a href="https://github.com/docling-project/docling">
<img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/docs/assets/docling_processing.png" width="100%"/> <img loading="lazy" alt="Docling" src="https://github.com/docling-project/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
</a> </a>
</p> </p>
@ -11,7 +11,7 @@
</p> </p>
[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869) [![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/) [![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://docling-project.github.io/docling/)
[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/) [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling)](https://pypi.org/project/docling/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling)](https://pypi.org/project/docling/)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/) [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
@ -19,7 +19,7 @@
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev) [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling) [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
@ -51,7 +51,7 @@ pip install docling
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures. Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs. More [detailed installation instructions](https://docling-project.github.io/docling/installation/) are available in the docs.
## Getting started ## Getting started
@ -66,28 +66,28 @@ result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]" print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
``` ```
More [advanced usage options](https://ds4sd.github.io/docling/usage/) are available in More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
the docs. the docs.
## Documentation ## Documentation
Check out Docling's [documentation](https://ds4sd.github.io/docling/), for details on Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
installation, usage, concepts, recipes, extensions, and more. installation, usage, concepts, recipes, extensions, and more.
## Examples ## Examples
Go hands-on with our [examples](https://ds4sd.github.io/docling/examples/), Go hands-on with our [examples](https://docling-project.github.io/docling/examples/),
demonstrating how to address different application use cases with Docling. demonstrating how to address different application use cases with Docling.
## Integrations ## Integrations
To further accelerate your AI application development, check out Docling's native To further accelerate your AI application development, check out Docling's native
[integrations](https://ds4sd.github.io/docling/integrations/) with popular frameworks [integrations](https://docling-project.github.io/docling/integrations/) with popular frameworks
and tools. and tools.
## Get help and support ## Get help and support
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions). Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
## Technical report ## Technical report
@ -95,7 +95,7 @@ For more details on Docling's inner workings, check out the [Docling Technical R
## Contributing ## Contributing
Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details. Please read [Contributing to Docling](https://github.com/docling-project/docling/blob/main/CONTRIBUTING.md) for details.
## References ## References
@ -123,6 +123,6 @@ For individual model usage, please refer to the model licenses found in the orig
Docling has been brought to you by IBM. Docling has been brought to you by IBM.
[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/ [supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/ [docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
[integrations]: https://ds4sd.github.io/docling/integrations/ [integrations]: https://docling-project.github.io/docling/integrations/

View File

@ -380,7 +380,7 @@ class AsciiDocBackend(DeclarativeDocumentBackend):
end_row_offset_idx=row_idx + row_span, end_row_offset_idx=row_idx + row_span,
start_col_offset_idx=col_idx, start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + col_span, end_col_offset_idx=col_idx + col_span,
col_header=False, column_header=row_idx == 0,
row_header=False, row_header=False,
) )
data.table_cells.append(cell) data.table_cells.append(cell)

View File

@ -111,7 +111,7 @@ class CsvDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=row_idx + 1, end_row_offset_idx=row_idx + 1,
start_col_offset_idx=col_idx, start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + 1, end_col_offset_idx=col_idx + 1,
col_header=row_idx == 0, # First row as header column_header=row_idx == 0, # First row as header
row_header=False, row_header=False,
) )
table_data.table_cells.append(cell) table_data.table_cells.append(cell)

View File

View File

View File

@ -0,0 +1,271 @@
# -*- coding: utf-8 -*-
"""
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/latex_dict.py
On 23/01/2025
"""
from __future__ import unicode_literals
CHARS = ("{", "}", "_", "^", "#", "&", "$", "%", "~")
BLANK = ""
BACKSLASH = "\\"
ALN = "&"
CHR = {
# Unicode : Latex Math Symbols
# Top accents
"\u0300": "\\grave{{{0}}}",
"\u0301": "\\acute{{{0}}}",
"\u0302": "\\hat{{{0}}}",
"\u0303": "\\tilde{{{0}}}",
"\u0304": "\\bar{{{0}}}",
"\u0305": "\\overbar{{{0}}}",
"\u0306": "\\breve{{{0}}}",
"\u0307": "\\dot{{{0}}}",
"\u0308": "\\ddot{{{0}}}",
"\u0309": "\\ovhook{{{0}}}",
"\u030a": "\\ocirc{{{0}}}}",
"\u030c": "\\check{{{0}}}}",
"\u0310": "\\candra{{{0}}}",
"\u0312": "\\oturnedcomma{{{0}}}",
"\u0315": "\\ocommatopright{{{0}}}",
"\u031a": "\\droang{{{0}}}",
"\u0338": "\\not{{{0}}}",
"\u20d0": "\\leftharpoonaccent{{{0}}}",
"\u20d1": "\\rightharpoonaccent{{{0}}}",
"\u20d2": "\\vertoverlay{{{0}}}",
"\u20d6": "\\overleftarrow{{{0}}}",
"\u20d7": "\\vec{{{0}}}",
"\u20db": "\\dddot{{{0}}}",
"\u20dc": "\\ddddot{{{0}}}",
"\u20e1": "\\overleftrightarrow{{{0}}}",
"\u20e7": "\\annuity{{{0}}}",
"\u20e9": "\\widebridgeabove{{{0}}}",
"\u20f0": "\\asteraccent{{{0}}}",
# Bottom accents
"\u0330": "\\wideutilde{{{0}}}",
"\u0331": "\\underbar{{{0}}}",
"\u20e8": "\\threeunderdot{{{0}}}",
"\u20ec": "\\underrightharpoondown{{{0}}}",
"\u20ed": "\\underleftharpoondown{{{0}}}",
"\u20ee": "\\underledtarrow{{{0}}}",
"\u20ef": "\\underrightarrow{{{0}}}",
# Over | group
"\u23b4": "\\overbracket{{{0}}}",
"\u23dc": "\\overparen{{{0}}}",
"\u23de": "\\overbrace{{{0}}}",
# Under| group
"\u23b5": "\\underbracket{{{0}}}",
"\u23dd": "\\underparen{{{0}}}",
"\u23df": "\\underbrace{{{0}}}",
}
CHR_BO = {
# Big operators,
"\u2140": "\\Bbbsum",
"\u220f": "\\prod",
"\u2210": "\\coprod",
"\u2211": "\\sum",
"\u222b": "\\int",
"\u22c0": "\\bigwedge",
"\u22c1": "\\bigvee",
"\u22c2": "\\bigcap",
"\u22c3": "\\bigcup",
"\u2a00": "\\bigodot",
"\u2a01": "\\bigoplus",
"\u2a02": "\\bigotimes",
}
T = {
"\u2192": "\\rightarrow ",
# Greek letters
"\U0001d6fc": "\\alpha ",
"\U0001d6fd": "\\beta ",
"\U0001d6fe": "\\gamma ",
"\U0001d6ff": "\\theta ",
"\U0001d700": "\\epsilon ",
"\U0001d701": "\\zeta ",
"\U0001d702": "\\eta ",
"\U0001d703": "\\theta ",
"\U0001d704": "\\iota ",
"\U0001d705": "\\kappa ",
"\U0001d706": "\\lambda ",
"\U0001d707": "\\m ",
"\U0001d708": "\\n ",
"\U0001d709": "\\xi ",
"\U0001d70a": "\\omicron ",
"\U0001d70b": "\\pi ",
"\U0001d70c": "\\rho ",
"\U0001d70d": "\\varsigma ",
"\U0001d70e": "\\sigma ",
"\U0001d70f": "\\ta ",
"\U0001d710": "\\upsilon ",
"\U0001d711": "\\phi ",
"\U0001d712": "\\chi ",
"\U0001d713": "\\psi ",
"\U0001d714": "\\omega ",
"\U0001d715": "\\partial ",
"\U0001d716": "\\varepsilon ",
"\U0001d717": "\\vartheta ",
"\U0001d718": "\\varkappa ",
"\U0001d719": "\\varphi ",
"\U0001d71a": "\\varrho ",
"\U0001d71b": "\\varpi ",
# Relation symbols
"\u2190": "\\leftarrow ",
"\u2191": "\\uparrow ",
"\u2192": "\\rightarrow ",
"\u2193": "\\downright ",
"\u2194": "\\leftrightarrow ",
"\u2195": "\\updownarrow ",
"\u2196": "\\nwarrow ",
"\u2197": "\\nearrow ",
"\u2198": "\\searrow ",
"\u2199": "\\swarrow ",
"\u22ee": "\\vdots ",
"\u22ef": "\\cdots ",
"\u22f0": "\\adots ",
"\u22f1": "\\ddots ",
"\u2260": "\\ne ",
"\u2264": "\\leq ",
"\u2265": "\\geq ",
"\u2266": "\\leqq ",
"\u2267": "\\geqq ",
"\u2268": "\\lneqq ",
"\u2269": "\\gneqq ",
"\u226a": "\\ll ",
"\u226b": "\\gg ",
"\u2208": "\\in ",
"\u2209": "\\notin ",
"\u220b": "\\ni ",
"\u220c": "\\nni ",
# Ordinary symbols
"\u221e": "\\infty ",
# Binary relations
"\u00b1": "\\pm ",
"\u2213": "\\mp ",
# Italic, Latin, uppercase
"\U0001d434": "A",
"\U0001d435": "B",
"\U0001d436": "C",
"\U0001d437": "D",
"\U0001d438": "E",
"\U0001d439": "F",
"\U0001d43a": "G",
"\U0001d43b": "H",
"\U0001d43c": "I",
"\U0001d43d": "J",
"\U0001d43e": "K",
"\U0001d43f": "L",
"\U0001d440": "M",
"\U0001d441": "N",
"\U0001d442": "O",
"\U0001d443": "P",
"\U0001d444": "Q",
"\U0001d445": "R",
"\U0001d446": "S",
"\U0001d447": "T",
"\U0001d448": "U",
"\U0001d449": "V",
"\U0001d44a": "W",
"\U0001d44b": "X",
"\U0001d44c": "Y",
"\U0001d44d": "Z",
# Italic, Latin, lowercase
"\U0001d44e": "a",
"\U0001d44f": "b",
"\U0001d450": "c",
"\U0001d451": "d",
"\U0001d452": "e",
"\U0001d453": "f",
"\U0001d454": "g",
"\U0001d456": "i",
"\U0001d457": "j",
"\U0001d458": "k",
"\U0001d459": "l",
"\U0001d45a": "m",
"\U0001d45b": "n",
"\U0001d45c": "o",
"\U0001d45d": "p",
"\U0001d45e": "q",
"\U0001d45f": "r",
"\U0001d460": "s",
"\U0001d461": "t",
"\U0001d462": "u",
"\U0001d463": "v",
"\U0001d464": "w",
"\U0001d465": "x",
"\U0001d466": "y",
"\U0001d467": "z",
}
FUNC = {
"sin": "\\sin({fe})",
"cos": "\\cos({fe})",
"tan": "\\tan({fe})",
"arcsin": "\\arcsin({fe})",
"arccos": "\\arccos({fe})",
"arctan": "\\arctan({fe})",
"arccot": "\\arccot({fe})",
"sinh": "\\sinh({fe})",
"cosh": "\\cosh({fe})",
"tanh": "\\tanh({fe})",
"coth": "\\coth({fe})",
"sec": "\\sec({fe})",
"csc": "\\csc({fe})",
}
FUNC_PLACE = "{fe}"
BRK = "\\\\"
CHR_DEFAULT = {
"ACC_VAL": "\\hat{{{0}}}",
}
POS = {
"top": "\\overline{{{0}}}", # not sure
"bot": "\\underline{{{0}}}",
}
POS_DEFAULT = {
"BAR_VAL": "\\overline{{{0}}}",
}
SUB = "_{{{0}}}"
SUP = "^{{{0}}}"
F = {
"bar": "\\frac{{{num}}}{{{den}}}",
"skw": r"^{{{num}}}/_{{{den}}}",
"noBar": "\\genfrac{{}}{{}}{{0pt}}{{}}{{{num}}}{{{den}}}",
"lin": "{{{num}}}/{{{den}}}",
}
F_DEFAULT = "\\frac{{{num}}}{{{den}}}"
D = "\\left{left}{text}\\right{right}"
D_DEFAULT = {
"left": "(",
"right": ")",
"null": ".",
}
RAD = "\\sqrt[{deg}]{{{text}}}"
RAD_DEFAULT = "\\sqrt{{{text}}}"
ARR = "{text}"
LIM_FUNC = {
"lim": "\\lim_{{{lim}}}",
"max": "\\max_{{{lim}}}",
"min": "\\min_{{{lim}}}",
}
LIM_TO = ("\\rightarrow", "\\to")
LIM_UPP = "\\overset{{{lim}}}{{{text}}}"
M = "\\begin{{matrix}}{text}\\end{{matrix}}"

View File

@ -0,0 +1,453 @@
"""
Office Math Markup Language (OMML)
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/omml.py
On 23/01/2025
"""
import lxml.etree as ET
from pylatexenc.latexencode import UnicodeToLatexEncoder
from docling.backend.docx.latex.latex_dict import (
ALN,
ARR,
BACKSLASH,
BLANK,
BRK,
CHARS,
CHR,
CHR_BO,
CHR_DEFAULT,
D_DEFAULT,
F_DEFAULT,
FUNC,
FUNC_PLACE,
LIM_FUNC,
LIM_TO,
LIM_UPP,
POS,
POS_DEFAULT,
RAD,
RAD_DEFAULT,
SUB,
SUP,
D,
F,
M,
T,
)
OMML_NS = "{http://schemas.openxmlformats.org/officeDocument/2006/math}"
def load(stream):
tree = ET.parse(stream)
for omath in tree.findall(OMML_NS + "oMath"):
yield oMath2Latex(omath)
def load_string(string):
root = ET.fromstring(string)
for omath in root.findall(OMML_NS + "oMath"):
yield oMath2Latex(omath)
def escape_latex(strs):
last = None
new_chr = []
strs = strs.replace(r"\\", "\\")
for c in strs:
if (c in CHARS) and (last != BACKSLASH):
new_chr.append(BACKSLASH + c)
else:
new_chr.append(c)
last = c
return BLANK.join(new_chr)
def get_val(key, default=None, store=CHR):
if key is not None:
return key if not store else store.get(key, key)
else:
return default
class Tag2Method(object):
def call_method(self, elm, stag=None):
getmethod = self.tag2meth.get
if stag is None:
stag = elm.tag.replace(OMML_NS, "")
method = getmethod(stag)
if method:
return method(self, elm)
else:
return None
def process_children_list(self, elm, include=None):
"""
process children of the elm,return iterable
"""
for _e in list(elm):
if OMML_NS not in _e.tag:
continue
stag = _e.tag.replace(OMML_NS, "")
if include and (stag not in include):
continue
t = self.call_method(_e, stag=stag)
if t is None:
t = self.process_unknow(_e, stag)
if t is None:
continue
yield (stag, t, _e)
def process_children_dict(self, elm, include=None):
"""
process children of the elm,return dict
"""
latex_chars = dict()
for stag, t, e in self.process_children_list(elm, include):
latex_chars[stag] = t
return latex_chars
def process_children(self, elm, include=None):
"""
process children of the elm,return string
"""
return BLANK.join(
(
t if not isinstance(t, Tag2Method) else str(t)
for stag, t, e in self.process_children_list(elm, include)
)
)
def process_unknow(self, elm, stag):
return None
class Pr(Tag2Method):
text = ""
__val_tags = ("chr", "pos", "begChr", "endChr", "type")
__innerdict = None # can't use the __dict__
""" common properties of element"""
def __init__(self, elm):
self.__innerdict = {}
self.text = self.process_children(elm)
def __str__(self):
return self.text
def __unicode__(self):
return self.__str__(self)
def __getattr__(self, name):
return self.__innerdict.get(name, None)
def do_brk(self, elm):
self.__innerdict["brk"] = BRK
return BRK
def do_common(self, elm):
stag = elm.tag.replace(OMML_NS, "")
if stag in self.__val_tags:
t = elm.get("{0}val".format(OMML_NS))
self.__innerdict[stag] = t
return None
tag2meth = {
"brk": do_brk,
"chr": do_common,
"pos": do_common,
"begChr": do_common,
"endChr": do_common,
"type": do_common,
}
class oMath2Latex(Tag2Method):
"""
Convert oMath element of omml to latex
"""
_t_dict = T
__direct_tags = ("box", "sSub", "sSup", "sSubSup", "num", "den", "deg", "e")
u = UnicodeToLatexEncoder(
replacement_latex_protection="braces-all",
unknown_char_policy="keep",
unknown_char_warning=False,
)
def __init__(self, element):
self._latex = self.process_children(element)
def __str__(self):
return self.latex.replace(" ", " ")
def __unicode__(self):
return self.__str__(self)
def process_unknow(self, elm, stag):
if stag in self.__direct_tags:
return self.process_children(elm)
elif stag[-2:] == "Pr":
return Pr(elm)
else:
return None
@property
def latex(self):
return self._latex
def do_acc(self, elm):
"""
the accent function
"""
c_dict = self.process_children_dict(elm)
latex_s = get_val(
c_dict["accPr"].chr, default=CHR_DEFAULT.get("ACC_VAL"), store=CHR
)
return latex_s.format(c_dict["e"])
def do_bar(self, elm):
"""
the bar function
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["barPr"]
latex_s = get_val(pr.pos, default=POS_DEFAULT.get("BAR_VAL"), store=POS)
return pr.text + latex_s.format(c_dict["e"])
def do_d(self, elm):
"""
the delimiter object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["dPr"]
null = D_DEFAULT.get("null")
s_val = get_val(pr.begChr, default=D_DEFAULT.get("left"), store=T)
e_val = get_val(pr.endChr, default=D_DEFAULT.get("right"), store=T)
delim = pr.text + D.format(
left=null if not s_val else escape_latex(s_val),
text=c_dict["e"],
right=null if not e_val else escape_latex(e_val),
)
return delim
def do_spre(self, elm):
"""
the Pre-Sub-Superscript object -- Not support yet
"""
pass
def do_sub(self, elm):
text = self.process_children(elm)
return SUB.format(text)
def do_sup(self, elm):
text = self.process_children(elm)
return SUP.format(text)
def do_f(self, elm):
"""
the fraction object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["fPr"]
latex_s = get_val(pr.type, default=F_DEFAULT, store=F)
return pr.text + latex_s.format(num=c_dict.get("num"), den=c_dict.get("den"))
def do_func(self, elm):
"""
the Function-Apply object (Examples:sin cos)
"""
c_dict = self.process_children_dict(elm)
func_name = c_dict.get("fName")
return func_name.replace(FUNC_PLACE, c_dict.get("e"))
def do_fname(self, elm):
"""
the func name
"""
latex_chars = []
for stag, t, e in self.process_children_list(elm):
if stag == "r":
if FUNC.get(t):
latex_chars.append(FUNC[t])
else:
raise NotSupport("Not support func %s" % t)
else:
latex_chars.append(t)
t = BLANK.join(latex_chars)
return t if FUNC_PLACE in t else t + FUNC_PLACE # do_func will replace this
def do_groupchr(self, elm):
"""
the Group-Character object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["groupChrPr"]
latex_s = get_val(pr.chr)
return pr.text + latex_s.format(c_dict["e"])
def do_rad(self, elm):
"""
the radical object
"""
c_dict = self.process_children_dict(elm)
text = c_dict.get("e")
deg_text = c_dict.get("deg")
if deg_text:
return RAD.format(deg=deg_text, text=text)
else:
return RAD_DEFAULT.format(text=text)
def do_eqarr(self, elm):
"""
the Array object
"""
return ARR.format(
text=BRK.join(
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
)
)
def do_limlow(self, elm):
"""
the Lower-Limit object
"""
t_dict = self.process_children_dict(elm, include=("e", "lim"))
latex_s = LIM_FUNC.get(t_dict["e"])
if not latex_s:
raise NotSupport("Not support lim %s" % t_dict["e"])
else:
return latex_s.format(lim=t_dict.get("lim"))
def do_limupp(self, elm):
"""
the Upper-Limit object
"""
t_dict = self.process_children_dict(elm, include=("e", "lim"))
return LIM_UPP.format(lim=t_dict.get("lim"), text=t_dict.get("e"))
def do_lim(self, elm):
"""
the lower limit of the limLow object and the upper limit of the limUpp function
"""
return self.process_children(elm).replace(LIM_TO[0], LIM_TO[1])
def do_m(self, elm):
"""
the Matrix object
"""
rows = []
for stag, t, e in self.process_children_list(elm):
if stag == "mPr":
pass
elif stag == "mr":
rows.append(t)
return M.format(text=BRK.join(rows))
def do_mr(self, elm):
"""
a single row of the matrix m
"""
return ALN.join(
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
)
def do_nary(self, elm):
"""
the n-ary object
"""
res = []
bo = ""
for stag, t, e in self.process_children_list(elm):
if stag == "naryPr":
bo = get_val(t.chr, store=CHR_BO)
else:
res.append(t)
return bo + BLANK.join(res)
def process_unicode(self, s):
# s = s if isinstance(s,unicode) else unicode(s,'utf-8')
# print(s, self._t_dict.get(s, s), unicode_to_latex(s))
# _str.append( self._t_dict.get(s, s) )
out_latex_str = self.u.unicode_to_latex(s)
# print(s, out_latex_str)
if (
s.startswith("{") is False
and out_latex_str.startswith("{")
and s.endswith("}") is False
and out_latex_str.endswith("}")
):
out_latex_str = f" {out_latex_str[1:-1]} "
# print(s, out_latex_str)
if "ensuremath" in out_latex_str:
out_latex_str = out_latex_str.replace("\\ensuremath{", " ")
out_latex_str = out_latex_str.replace("}", " ")
# print(s, out_latex_str)
if out_latex_str.strip().startswith("\\text"):
out_latex_str = f" \\text{{{out_latex_str}}} "
# print(s, out_latex_str)
return out_latex_str
def do_r(self, elm):
"""
Get text from 'r' element,And try convert them to latex symbols
@todo text style support , (sty)
@todo \text (latex pure text support)
"""
_str = []
_base_str = []
for s in elm.findtext("./{0}t".format(OMML_NS)):
out_latex_str = self.process_unicode(s)
_str.append(out_latex_str)
_base_str.append(s)
proc_str = escape_latex(BLANK.join(_str))
base_proc_str = BLANK.join(_base_str)
if "{" not in base_proc_str and "\\{" in proc_str:
proc_str = proc_str.replace("\\{", "{")
if "}" not in base_proc_str and "\\}" in proc_str:
proc_str = proc_str.replace("\\}", "}")
return proc_str
tag2meth = {
"acc": do_acc,
"r": do_r,
"bar": do_bar,
"sub": do_sub,
"sup": do_sup,
"f": do_f,
"func": do_func,
"fName": do_fname,
"groupChr": do_groupchr,
"d": do_d,
"rad": do_rad,
"eqArr": do_eqarr,
"limLow": do_limlow,
"limUpp": do_limupp,
"lim": do_lim,
"m": do_m,
"mr": do_mr,
"nary": do_nary,
}

View File

@ -134,7 +134,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
self.analyze_tag(cast(Tag, element), doc) self.analyze_tag(cast(Tag, element), doc)
except Exception as exc_child: except Exception as exc_child:
_log.error( _log.error(
f"Error processing child from tag{tag.name}: {exc_child}" f"Error processing child from tag {tag.name}: {repr(exc_child)}"
) )
raise exc_child raise exc_child
elif isinstance(element, NavigableString) and not isinstance( elif isinstance(element, NavigableString) and not isinstance(
@ -347,11 +347,11 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
content_layer=self.content_layer, content_layer=self.content_layer,
) )
self.level += 1 self.level += 1
self.walk(element, doc)
self.walk(element, doc) self.parents[self.level + 1] = None
self.level -= 1
self.parents[self.level + 1] = None else:
self.level -= 1 self.walk(element, doc)
elif element.text.strip(): elif element.text.strip():
text = element.text.strip() text = element.text.strip()
@ -457,7 +457,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=row_idx + row_span, end_row_offset_idx=row_idx + row_span,
start_col_offset_idx=col_idx, start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + col_span, end_col_offset_idx=col_idx + col_span,
col_header=col_header, column_header=col_header,
row_header=((not col_header) and html_cell.name == "th"), row_header=((not col_header) and html_cell.name == "th"),
) )
data.table_cells.append(table_cell) data.table_cells.append(table_cell)

View File

@ -136,7 +136,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=trow_ind + row_span, end_row_offset_idx=trow_ind + row_span,
start_col_offset_idx=tcol_ind, start_col_offset_idx=tcol_ind,
end_col_offset_idx=tcol_ind + col_span, end_col_offset_idx=tcol_ind + col_span,
col_header=False, column_header=trow_ind == 0,
row_header=False, row_header=False,
) )
tcells.append(icell) tcells.append(icell)

View File

@ -164,7 +164,7 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=excel_cell.row + excel_cell.row_span, end_row_offset_idx=excel_cell.row + excel_cell.row_span,
start_col_offset_idx=excel_cell.col, start_col_offset_idx=excel_cell.col,
end_col_offset_idx=excel_cell.col + excel_cell.col_span, end_col_offset_idx=excel_cell.col + excel_cell.col_span,
col_header=False, column_header=excel_cell.row == 0,
row_header=False, row_header=False,
) )
table_data.table_cells.append(cell) table_data.table_cells.append(cell)
@ -173,7 +173,7 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
return doc return doc
def _find_data_tables(self, sheet: Worksheet): def _find_data_tables(self, sheet: Worksheet) -> List[ExcelTable]:
""" """
Find all compact rectangular data tables in a sheet. Find all compact rectangular data tables in a sheet.
""" """
@ -340,47 +340,4 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
except: except:
_log.error("could not extract the image from excel sheets") _log.error("could not extract the image from excel sheets")
"""
for idx, chart in enumerate(sheet._charts): # type: ignore
try:
chart_path = f"chart_{idx + 1}.png"
_log.info(
f"Chart found, but dynamic rendering is required for: {chart_path}"
)
_log.info(f"Chart {idx + 1}:")
# Chart type
# _log.info(f"Type: {type(chart).__name__}")
print(f"Type: {type(chart).__name__}")
# Extract series data
for series_idx, series in enumerate(chart.series):
#_log.info(f"Series {series_idx + 1}:")
print(f"Series {series_idx + 1} type: {type(series).__name__}")
#print(f"x-values: {series.xVal}")
#print(f"y-values: {series.yVal}")
print(f"xval type: {type(series.xVal).__name__}")
xvals = []
for _ in series.xVal.numLit.pt:
print(f"xval type: {type(_).__name__}")
if hasattr(_, 'v'):
xvals.append(_.v)
print(f"x-values: {xvals}")
yvals = []
for _ in series.yVal:
if hasattr(_, 'v'):
yvals.append(_.v)
print(f"y-values: {yvals}")
except Exception as exc:
print(exc)
continue
"""
return doc return doc

View File

@ -346,7 +346,7 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
end_row_offset_idx=row_idx + row_span, end_row_offset_idx=row_idx + row_span,
start_col_offset_idx=col_idx, start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + col_span, end_col_offset_idx=col_idx + col_span,
col_header=False, column_header=row_idx == 0,
row_header=False, row_header=False,
) )
if len(cell.text.strip()) > 0: if len(cell.text.strip()) > 0:

View File

@ -26,6 +26,7 @@ from PIL import Image, UnidentifiedImageError
from typing_extensions import override from typing_extensions import override
from docling.backend.abstract_backend import DeclarativeDocumentBackend from docling.backend.abstract_backend import DeclarativeDocumentBackend
from docling.backend.docx.latex.omml import oMath2Latex
from docling.datamodel.base_models import InputFormat from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument from docling.datamodel.document import InputDocument
@ -260,6 +261,25 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
else: else:
return label, None return label, None
def handle_equations_in_text(self, element, text):
only_texts = []
only_equations = []
texts_and_equations = []
for subt in element.iter():
tag_name = etree.QName(subt).localname
if tag_name == "t" and "math" not in subt.tag:
only_texts.append(subt.text)
texts_and_equations.append(subt.text)
elif "oMath" in subt.tag and "oMathPara" not in subt.tag:
latex_equation = str(oMath2Latex(subt))
only_equations.append(latex_equation)
texts_and_equations.append(latex_equation)
if "".join(only_texts) != text:
return text
return "".join(texts_and_equations), only_equations
def handle_text_elements( def handle_text_elements(
self, self,
element: BaseOxmlElement, element: BaseOxmlElement,
@ -268,9 +288,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
) -> None: ) -> None:
paragraph = Paragraph(element, docx_obj) paragraph = Paragraph(element, docx_obj)
if paragraph.text is None: raw_text = paragraph.text
text, equations = self.handle_equations_in_text(element=element, text=raw_text)
if text is None:
return return
text = paragraph.text.strip() text = text.strip()
# Common styles for bullet and numbered lists. # Common styles for bullet and numbered lists.
# "List Bullet", "List Number", "List Paragraph" # "List Bullet", "List Number", "List Paragraph"
@ -323,6 +346,45 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
elif "Heading" in p_style_id: elif "Heading" in p_style_id:
self.add_header(doc, p_level, text) self.add_header(doc, p_level, text)
elif len(equations) > 0:
if (raw_text is None or len(raw_text) == 0) and len(text) > 0:
# Standalone equation
level = self.get_level()
doc.add_text(
label=DocItemLabel.FORMULA,
parent=self.parents[level - 1],
text=text,
)
else:
# Inline equation
level = self.get_level()
inline_equation = doc.add_group(
label=GroupLabel.INLINE, parent=self.parents[level - 1]
)
text_tmp = text
for eq in equations:
if len(text_tmp) == 0:
break
pre_eq_text = text_tmp.split(eq, maxsplit=1)[0]
text_tmp = text_tmp.split(eq, maxsplit=1)[1]
if len(pre_eq_text) > 0:
doc.add_text(
label=DocItemLabel.PARAGRAPH,
parent=inline_equation,
text=pre_eq_text,
)
doc.add_text(
label=DocItemLabel.FORMULA,
parent=inline_equation,
text=eq,
)
if len(text_tmp) > 0:
doc.add_text(
label=DocItemLabel.PARAGRAPH,
parent=inline_equation,
text=text_tmp,
)
elif p_style_id in [ elif p_style_id in [
"Paragraph", "Paragraph",
"Normal", "Normal",
@ -539,7 +601,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=row.grid_cols_before + spanned_idx, end_row_offset_idx=row.grid_cols_before + spanned_idx,
start_col_offset_idx=col_idx, start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + cell.grid_span, end_col_offset_idx=col_idx + cell.grid_span,
col_header=False, column_header=row.grid_cols_before + row_idx == 0,
row_header=False, row_header=False,
) )
data.table_cells.append(table_cell) data.table_cells.append(table_cell)

View File

@ -121,7 +121,7 @@ def download(
"Using the CLI:", "Using the CLI:",
f"`docling --artifacts-path={output_dir} FILE`", f"`docling --artifacts-path={output_dir} FILE`",
"\n", "\n",
"Using Python: see the documentation at <https://ds4sd.github.io/docling/usage>.", "Using Python: see the documentation at <https://docling-project.github.io/docling/usage>.",
) )

View File

@ -27,7 +27,7 @@ class OcrMacModel(BaseOcrModel):
"ocrmac is not correctly installed. " "ocrmac is not correctly installed. "
"Please install it via `pip install ocrmac` to use this OCR engine. " "Please install it via `pip install ocrmac` to use this OCR engine. "
"Alternatively, Docling has support for other OCR engines. See the documentation: " "Alternatively, Docling has support for other OCR engines. See the documentation: "
"https://ds4sd.github.io/docling/installation/" "https://docling-project.github.io/docling/installation/"
) )
try: try:
from ocrmac import ocrmac from ocrmac import ocrmac

View File

@ -32,14 +32,14 @@ class TesseractOcrModel(BaseOcrModel):
"Note that tesserocr might have to be manually compiled for working with " "Note that tesserocr might have to be manually compiled for working with "
"your Tesseract installation. The Docling documentation provides examples for it. " "your Tesseract installation. The Docling documentation provides examples for it. "
"Alternatively, Docling has support for other OCR engines. See the documentation: " "Alternatively, Docling has support for other OCR engines. See the documentation: "
"https://ds4sd.github.io/docling/installation/" "https://docling-project.github.io/docling/installation/"
) )
missing_langs_errmsg = ( missing_langs_errmsg = (
"tesserocr is not correctly configured. No language models have been detected. " "tesserocr is not correctly configured. No language models have been detected. "
"Please ensure that the TESSDATA_PREFIX envvar points to tesseract languages dir. " "Please ensure that the TESSDATA_PREFIX envvar points to tesseract languages dir. "
"You can find more information how to setup other OCR engines in Docling " "You can find more information how to setup other OCR engines in Docling "
"documentation: " "documentation: "
"https://ds4sd.github.io/docling/installation/" "https://docling-project.github.io/docling/installation/"
) )
try: try:

View File

@ -7,7 +7,7 @@ pydantic datatype, which can express several features common to documents, such
* Layout information (i.e. bounding boxes) for all items, if available * Layout information (i.e. bounding boxes) for all items, if available
* Provenance information * Provenance information
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/DS4SD/docling-core/tree/main/docling_core/types/doc). The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/docling-project/docling-core/tree/main/docling_core/types/doc).
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch. It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.

View File

@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" "<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
] ]
}, },
{ {
@ -36,7 +36,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"This is an example of using [Docling](https://ds4sd.github.io/docling/) for converting structured data (XML) into a unified document\n", "This is an example of using [Docling](https://docling-project.github.io/docling/) for converting structured data (XML) into a unified document\n",
"representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n", "representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n",
"\n", "\n",
"Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n", "Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n",

View File

@ -103,7 +103,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://ds4sd.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)." "> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
] ]
}, },
{ {

View File

@ -321,7 +321,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "docling-aMWN2FRM-py3.12", "display_name": "docling-hgXEfXco-py3.12",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },

View File

@ -36,7 +36,7 @@
"## A recipe 🧑‍🍳 🐥 💚\n", "## A recipe 🧑‍🍳 🐥 💚\n",
"\n", "\n",
"This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:\n", "This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:\n",
"- [Docling](https://ds4sd.github.io/docling/) for document parsing and chunking\n", "- [Docling](https://docling-project.github.io/docling/) for document parsing and chunking\n",
"- [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search/?msockid=0109678bea39665431e37323ebff6723) for vector indexing and retrieval\n", "- [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search/?msockid=0109678bea39665431e37323ebff6723) for vector indexing and retrieval\n",
"- [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service?msockid=0109678bea39665431e37323ebff6723) for embeddings and chat completion\n", "- [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service?msockid=0109678bea39665431e37323ebff6723) for embeddings and chat completion\n",
"\n", "\n",

View File

@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" "<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
] ]
}, },
{ {
@ -247,7 +247,7 @@
"name": "stderr", "name": "stderr",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n", "/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
" warnings.warn(\n" " warnings.warn(\n"
] ]
} }

View File

@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" "<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
] ]
}, },
{ {
@ -168,7 +168,7 @@
"source": [ "source": [
"> Note: a message saying `\"Token indices sequence length is longer than the specified\n", "> Note: a message saying `\"Token indices sequence length is longer than the specified\n",
"maximum sequence length...\"` can be ignored in this case — details\n", "maximum sequence length...\"` can be ignored in this case — details\n",
"[here](https://github.com/DS4SD/docling-core/issues/119#issuecomment-2577418826)." "[here](https://github.com/docling-project/docling-core/issues/119#issuecomment-2577418826)."
] ]
}, },
{ {

View File

@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" "<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
] ]
}, },
{ {

View File

@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_weaviate.ipynb)" "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
] ]
}, },
{ {
@ -29,7 +29,7 @@
"\n", "\n",
"## A recipe 🧑‍🍳 🐥 💚\n", "## A recipe 🧑‍🍳 🐥 💚\n",
"\n", "\n",
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://ds4sd.github.io/docling/).\n", "This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://docling-project.github.io/docling/).\n",
"\n", "\n",
"In this notebook, we accomplish the following:\n", "In this notebook, we accomplish the following:\n",
"* Parse the top machine learning papers on [arXiv](https://arxiv.org/) using Docling\n", "* Parse the top machine learning papers on [arXiv](https://arxiv.org/) using Docling\n",

View File

@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/hybrid_rag_qdrant\n", "<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" ".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
] ]
}, },
@ -109,7 +109,7 @@
"name": "stderr", "name": "stderr",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n", "/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
" warnings.warn(\n" " warnings.warn(\n"
] ]
} }

View File

@ -1,6 +1,6 @@
# FAQ # FAQ
This is a collection of FAQ collected from the user questions on <https://github.com/DS4SD/docling/discussions>. This is a collection of FAQ collected from the user questions on <https://github.com/docling-project/docling/discussions>.
??? question "Is Python 3.13 supported?" ??? question "Is Python 3.13 supported?"
@ -41,7 +41,7 @@ This is a collection of FAQ collected from the user questions on <https://github
] ]
``` ```
Source: Issue [#283](https://github.com/DS4SD/docling/issues/283#issuecomment-2465035868) Source: Issue [#283](https://github.com/docling-project/docling/issues/283#issuecomment-2465035868)
??? question "Are text styles (bold, underline, etc) supported?" ??? question "Are text styles (bold, underline, etc) supported?"
@ -74,7 +74,7 @@ This is a collection of FAQ collected from the user questions on <https://github
) )
``` ```
Source: Issue [#326](https://github.com/DS4SD/docling/issues/326) Source: Issue [#326](https://github.com/docling-project/docling/issues/326)
??? question " Which model weights are needed to run Docling?" ??? question " Which model weights are needed to run Docling?"
@ -84,7 +84,7 @@ This is a collection of FAQ collected from the user questions on <https://github
For processing PDF documents, Docling requires the model weights from <https://huggingface.co/ds4sd/docling-models>. For processing PDF documents, Docling requires the model weights from <https://huggingface.co/ds4sd/docling-models>.
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/DS4SD/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior. When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/docling-project/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
??? question "SSL error downloading model weights" ??? question "SSL error downloading model weights"
@ -174,6 +174,6 @@ This is a collection of FAQ collected from the user questions on <https://github
print(f"Model max length: {tokenizer.model_max_length}") print(f"Model max length: {tokenizer.model_max_length}")
``` ```
Also see [docling#725](https://github.com/DS4SD/docling/issues/725). Also see [docling#725](https://github.com/docling-project/docling/issues/725).
Source: Issue [docling-core#119](https://github.com/DS4SD/docling-core/issues/119) Source: Issue [docling-core#119](https://github.com/docling-project/docling-core/issues/119)

View File

@ -11,7 +11,7 @@
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev) [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling) [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

View File

@ -5,7 +5,7 @@ Docling is available as a converter in [Haystack](https://haystack.deepset.ai/):
- 🧑🏽‍🍳 [Docling Haystack integration example][example] - 🧑🏽‍🍳 [Docling Haystack integration example][example]
- 📦 [Docling Haystack integration PyPI][pypi] - 📦 [Docling Haystack integration PyPI][pypi]
[github]: https://github.com/DS4SD/docling-haystack [github]: https://github.com/docling-project/docling-haystack
[docs]: https://haystack.deepset.ai/integrations/docling [docs]: https://haystack.deepset.ai/integrations/docling
[pypi]: https://pypi.org/project/docling-haystack [pypi]: https://pypi.org/project/docling-haystack
[example]: ../examples/rag_haystack.ipynb [example]: ../examples/rag_haystack.ipynb

View File

@ -8,7 +8,7 @@ To get started, check out the [step-by-step guide in LangChain][guide].
- 📦 [LangChain Docling integration PyPI][pypi] - 📦 [LangChain Docling integration PyPI][pypi]
[docs]: https://python.langchain.com/docs/integrations/providers/docling/ [docs]: https://python.langchain.com/docs/integrations/providers/docling/
[github]: https://github.com/DS4SD/docling-langchain [github]: https://github.com/docling-project/docling-langchain
[guide]: https://python.langchain.com/docs/integrations/document_loaders/docling/ [guide]: https://python.langchain.com/docs/integrations/document_loaders/docling/
[example]: ../examples/rag_langchain.ipynb [example]: ../examples/rag_langchain.ipynb
[pypi]: https://pypi.org/project/langchain-docling/ [pypi]: https://pypi.org/project/langchain-docling/

View File

@ -1,7 +1,7 @@
site_name: Docling site_name: Docling
site_url: https://ds4sd.github.io/docling/ site_url: https://docling-project.github.io/docling/
repo_name: DS4SD/docling repo_name: docling-project/docling
repo_url: https://github.com/DS4SD/docling repo_url: https://github.com/docling-project/docling
theme: theme:
name: material name: material

42
poetry.lock generated
View File

@ -946,8 +946,8 @@ tabulate = ">=0.9.0,<1.0.0"
[package.source] [package.source]
type = "git" type = "git"
url = "https://github.com/DS4SD/docling-parse" url = "https://github.com/DS4SD/docling-parse"
reference = "cau/api-move-to-docling-core" reference = "main"
resolved_reference = "6d573965abf6e3492b5d41d5cfc52ccd77ab100c" resolved_reference = "a655bc9d59c287661111f6f3d351d61f2239bd86"
[[package]] [[package]]
name = "docutils" name = "docutils"
@ -1064,13 +1064,13 @@ devel = ["colorama", "json-spec", "jsonschema", "pylint", "pytest", "pytest-benc
[[package]] [[package]]
name = "filelock" name = "filelock"
version = "3.17.0" version = "3.18.0"
description = "A platform independent file lock." description = "A platform independent file lock."
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
files = [ files = [
{file = "filelock-3.17.0-py3-none-any.whl", hash = "sha256:533dc2f7ba78dc2f0f531fc6c4940addf7b70a481e269a5a3b93be94ffbe8338"}, {file = "filelock-3.18.0-py3-none-any.whl", hash = "sha256:c401f4f8377c4464e6db25fff06205fd89bdd83b65eb0488ed1b160f780e21de"},
{file = "filelock-3.17.0.tar.gz", hash = "sha256:ee4e77401ef576ebb38cd7f13b9b28893194acc20a8e68e18730ba9c0e54660e"}, {file = "filelock-3.18.0.tar.gz", hash = "sha256:adbc88eabb99d2fec8c9c1b229b171f18afa655400173ddc653d5d01501fb9f2"},
] ]
[package.extras] [package.extras]
@ -4413,20 +4413,20 @@ files = [
[[package]] [[package]]
name = "protobuf" name = "protobuf"
version = "6.30.0" version = "6.30.1"
description = "" description = ""
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
files = [ files = [
{file = "protobuf-6.30.0-cp310-abi3-win32.whl", hash = "sha256:7337d76d8efe65ee09ee566b47b5914c517190196f414e5418fa236dfd1aed3e"}, {file = "protobuf-6.30.1-cp310-abi3-win32.whl", hash = "sha256:ba0706f948d0195f5cac504da156d88174e03218d9364ab40d903788c1903d7e"},
{file = "protobuf-6.30.0-cp310-abi3-win_amd64.whl", hash = "sha256:9b33d51cc95a7ec4f407004c8b744330b6911a37a782e2629c67e1e8ac41318f"}, {file = "protobuf-6.30.1-cp310-abi3-win_amd64.whl", hash = "sha256:ed484f9ddd47f0f1bf0648806cccdb4fe2fb6b19820f9b79a5adf5dcfd1b8c5f"},
{file = "protobuf-6.30.0-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:52d4bb6fe76005860e1d0b8bfa126f5c97c19cc82704961f60718f50be16942d"}, {file = "protobuf-6.30.1-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:aa4f7dfaed0d840b03d08d14bfdb41348feaee06a828a8c455698234135b4075"},
{file = "protobuf-6.30.0-cp39-abi3-manylinux2014_aarch64.whl", hash = "sha256:7940ab4dfd60d514b2e1d3161549ea7aed5be37d53bafde16001ac470a3e202b"}, {file = "protobuf-6.30.1-cp39-abi3-manylinux2014_aarch64.whl", hash = "sha256:47cd320b7db63e8c9ac35f5596ea1c1e61491d8a8eb6d8b45edc44760b53a4f6"},
{file = "protobuf-6.30.0-cp39-abi3-manylinux2014_x86_64.whl", hash = "sha256:d79bf6a202a536b192b7e8d295d7eece0c86fbd9b583d147faf8cfeff46bf598"}, {file = "protobuf-6.30.1-cp39-abi3-manylinux2014_x86_64.whl", hash = "sha256:e3083660225fa94748ac2e407f09a899e6a28bf9c0e70c75def8d15706bf85fc"},
{file = "protobuf-6.30.0-cp39-cp39-win32.whl", hash = "sha256:bb35ad251d222f03d6c4652c072dfee156be0ef9578373929c1a7ead2bd5492c"}, {file = "protobuf-6.30.1-cp39-cp39-win32.whl", hash = "sha256:554d7e61cce2aa4c63ca27328f757a9f3867bce8ec213bf09096a8d16bcdcb6a"},
{file = "protobuf-6.30.0-cp39-cp39-win_amd64.whl", hash = "sha256:501810e0eba1d327e783fde47cc767a563b0f1c292f1a3546d4f2b8c3612d4d0"}, {file = "protobuf-6.30.1-cp39-cp39-win_amd64.whl", hash = "sha256:b510f55ce60f84dc7febc619b47215b900466e3555ab8cb1ba42deb4496d6cc0"},
{file = "protobuf-6.30.0-py3-none-any.whl", hash = "sha256:e5ef216ea061b262b8994cb6b7d6637a4fb27b3fb4d8e216a6040c0b93bd10d7"}, {file = "protobuf-6.30.1-py3-none-any.whl", hash = "sha256:3c25e51e1359f1f5fa3b298faa6016e650d148f214db2e47671131b9063c53be"},
{file = "protobuf-6.30.0.tar.gz", hash = "sha256:852b675d276a7d028f660da075af1841c768618f76b90af771a8e2c29e6f5965"}, {file = "protobuf-6.30.1.tar.gz", hash = "sha256:535fb4e44d0236893d5cf1263a0f706f1160b689a7ab962e9da8a9ce4050b780"},
] ]
[[package]] [[package]]
@ -4789,6 +4789,16 @@ files = [
[package.extras] [package.extras]
windows-terminal = ["colorama (>=0.4.6)"] windows-terminal = ["colorama (>=0.4.6)"]
[[package]]
name = "pylatexenc"
version = "2.10"
description = "Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion"
optional = false
python-versions = "*"
files = [
{file = "pylatexenc-2.10.tar.gz", hash = "sha256:3dd8fd84eb46dc30bee1e23eaab8d8fb5a7f507347b23e5f38ad9675c84f40d3"},
]
[[package]] [[package]]
name = "pylint" name = "pylint"
version = "2.17.7" version = "2.17.7"
@ -7806,4 +7816,4 @@ vlm = ["accelerate", "transformers", "transformers"]
[metadata] [metadata]
lock-version = "2.0" lock-version = "2.0"
python-versions = "^3.9" python-versions = "^3.9"
content-hash = "da6afcbfeefb3a45560d4098c5a1345333fc833fd13e6408aacb06c6d18317f0" content-hash = "86d3894f8f998af4b7f766ec5060f9f64d532d9b6611d4836271bc0fdfd796c7"

View File

@ -2,23 +2,43 @@
name = "docling" name = "docling"
version = "2.26.0" # DO NOT EDIT, updated automatically version = "2.26.0" # DO NOT EDIT, updated automatically
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications." description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"] authors = [
"Christoph Auer <cau@zurich.ibm.com>",
"Michele Dolfi <dol@zurich.ibm.com>",
"Maxim Lysak <mly@zurich.ibm.com>",
"Nikos Livathinos <nli@zurich.ibm.com>",
"Ahmed Nassar <ahn@zurich.ibm.com>",
"Panos Vagenas <pva@zurich.ibm.com>",
"Peter Staar <taa@zurich.ibm.com>",
]
license = "MIT" license = "MIT"
readme = "README.md" readme = "README.md"
repository = "https://github.com/DS4SD/docling" repository = "https://github.com/docling-project/docling"
homepage = "https://github.com/DS4SD/docling" homepage = "https://github.com/docling-project/docling"
keywords= ["docling", "convert", "document", "pdf", "docx", "html", "markdown", "layout model", "segmentation", "table structure", "table former"] keywords = [
classifiers = [ "docling",
"License :: OSI Approved :: MIT License", "convert",
"Operating System :: MacOS :: MacOS X", "document",
"Operating System :: POSIX :: Linux", "pdf",
"Development Status :: 5 - Production/Stable", "docx",
"Intended Audience :: Developers", "html",
"Intended Audience :: Science/Research", "markdown",
"Topic :: Scientific/Engineering :: Artificial Intelligence", "layout model",
"Programming Language :: Python :: 3" "segmentation",
] "table structure",
packages = [{include = "docling"}] "table former",
]
classifiers = [
"License :: OSI Approved :: MIT License",
"Operating System :: MacOS :: MacOS X",
"Operating System :: POSIX :: Linux",
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Programming Language :: Python :: 3",
]
packages = [{ include = "docling" }]
[tool.poetry.dependencies] [tool.poetry.dependencies]
###################### ######################
@ -28,7 +48,7 @@ python = "^3.9"
pydantic = "^2.0.0" pydantic = "^2.0.0"
docling-core = {extras = ["chunking"], version = "^2.23.0"} docling-core = {extras = ["chunking"], version = "^2.23.0"}
docling-ibm-models = "^3.4.0" docling-ibm-models = "^3.4.0"
docling-parse = {git = "https://github.com/DS4SD/docling-parse", rev = "cau/api-move-to-docling-core"} docling-parse = {git = "https://github.com/DS4SD/docling-parse", rev = "main"}
filetype = "^1.2.0" filetype = "^1.2.0"
pypdfium2 = "^4.30.0" pypdfium2 = "^4.30.0"
pydantic-settings = "^2.3.0" pydantic-settings = "^2.3.0"
@ -40,7 +60,7 @@ certifi = ">=2024.7.4"
rtree = "^1.3.0" rtree = "^1.3.0"
scipy = [ scipy = [
{ version = "^1.6.0", markers = "python_version >= '3.10'" }, { version = "^1.6.0", markers = "python_version >= '3.10'" },
{ version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" } { version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" },
] ]
typer = "^0.12.5" typer = "^0.12.5"
python-docx = "^1.1.2" python-docx = "^1.1.2"
@ -56,21 +76,22 @@ onnxruntime = [
# 1.19.2 is the last version with python3.9 support, # 1.19.2 is the last version with python3.9 support,
# see https://github.com/microsoft/onnxruntime/releases/tag/v1.20.0 # see https://github.com/microsoft/onnxruntime/releases/tag/v1.20.0
{ version = ">=1.7.0,<1.20.0", optional = true, markers = "python_version < '3.10'" }, { version = ">=1.7.0,<1.20.0", optional = true, markers = "python_version < '3.10'" },
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" } { version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" },
] ]
transformers = [ transformers = [
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^4.46.0", optional = true }, { markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^4.46.0", optional = true },
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~4.42.0", optional = true } { markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~4.42.0", optional = true },
] ]
accelerate = [ accelerate = [
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^1.2.1", optional = true }, { markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^1.2.1", optional = true },
] ]
pillow = ">=10.0.0,<12.0.0" pillow = ">=10.0.0,<12.0.0"
tqdm = "^4.65.0" tqdm = "^4.65.0"
pylatexenc = "^2.10"
[tool.poetry.group.dev.dependencies] [tool.poetry.group.dev.dependencies]
black = {extras = ["jupyter"], version = "^24.4.2"} black = { extras = ["jupyter"], version = "^24.4.2" }
pytest = "^7.2.2" pytest = "^7.2.2"
pre-commit = "^3.7.1" pre-commit = "^3.7.1"
mypy = "^1.10.1" mypy = "^1.10.1"
@ -93,7 +114,7 @@ types-tqdm = "^4.67.0.20241221"
mkdocs-material = "^9.5.40" mkdocs-material = "^9.5.40"
mkdocs-jupyter = "^0.25.0" mkdocs-jupyter = "^0.25.0"
mkdocs-click = "^0.8.1" mkdocs-click = "^0.8.1"
mkdocstrings = {extras = ["python"], version = "^0.27.0"} mkdocstrings = { extras = ["python"], version = "^0.27.0" }
griffe-pydantic = "^1.1.0" griffe-pydantic = "^1.1.0"
[tool.poetry.group.examples.dependencies] [tool.poetry.group.examples.dependencies]
@ -108,8 +129,8 @@ optional = true
[tool.poetry.group.constraints.dependencies] [tool.poetry.group.constraints.dependencies]
numpy = [ numpy = [
{ version = ">=1.24.4,<3.0.0", markers = 'python_version >= "3.10"' }, { version = ">=1.24.4,<3.0.0", markers = 'python_version >= "3.10"' },
{ version = ">=1.24.4,<2.1.0", markers = 'python_version < "3.10"' }, { version = ">=1.24.4,<2.1.0", markers = 'python_version < "3.10"' },
] ]
[tool.poetry.group.mac_intel] [tool.poetry.group.mac_intel]
@ -117,12 +138,12 @@ optional = true
[tool.poetry.group.mac_intel.dependencies] [tool.poetry.group.mac_intel.dependencies]
torch = [ torch = [
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^2.2.2"}, { markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^2.2.2" },
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~2.2.2"} { markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~2.2.2" },
] ]
torchvision = [ torchvision = [
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^0"}, { markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^0" },
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~0.17.2"} { markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~0.17.2" },
] ]
[tool.poetry.extras] [tool.poetry.extras]
@ -147,7 +168,7 @@ include = '\.pyi?$'
[tool.isort] [tool.isort]
profile = "black" profile = "black"
line_length = 88 line_length = 88
py_version=39 py_version = 39
[tool.mypy] [tool.mypy]
pretty = true pretty = true
@ -158,18 +179,19 @@ python_version = "3.10"
[[tool.mypy.overrides]] [[tool.mypy.overrides]]
module = [ module = [
"docling_parse.*", "docling_parse.*",
"pypdfium2.*", "pypdfium2.*",
"networkx.*", "networkx.*",
"scipy.*", "scipy.*",
"filetype.*", "filetype.*",
"tesserocr.*", "tesserocr.*",
"docling_ibm_models.*", "docling_ibm_models.*",
"easyocr.*", "easyocr.*",
"ocrmac.*", "ocrmac.*",
"lxml.*", "lxml.*",
"huggingface_hub.*", "huggingface_hub.*",
"transformers.*", "transformers.*",
"pylatexenc.*",
] ]
ignore_missing_imports = true ignore_missing_imports = true

Binary file not shown.

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "1", "text": "1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -63,7 +63,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "2", "text": "2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -75,7 +75,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "3", "text": "3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -87,7 +87,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "4", "text": "4",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -296,7 +296,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "1", "text": "1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -308,7 +308,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "2", "text": "2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -320,7 +320,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "3", "text": "3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -332,7 +332,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "4", "text": "4",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Index", "text": "Index",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -63,7 +63,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Customer Id", "text": "Customer Id",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -75,7 +75,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "First Name", "text": "First Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -87,7 +87,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Last Name", "text": "Last Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -99,7 +99,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "Company", "text": "Company",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -111,7 +111,7 @@
"start_col_offset_idx": 5, "start_col_offset_idx": 5,
"end_col_offset_idx": 6, "end_col_offset_idx": 6,
"text": "City", "text": "City",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -123,7 +123,7 @@
"start_col_offset_idx": 6, "start_col_offset_idx": 6,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Country", "text": "Country",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -135,7 +135,7 @@
"start_col_offset_idx": 7, "start_col_offset_idx": 7,
"end_col_offset_idx": 8, "end_col_offset_idx": 8,
"text": "Phone 1", "text": "Phone 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -147,7 +147,7 @@
"start_col_offset_idx": 8, "start_col_offset_idx": 8,
"end_col_offset_idx": 9, "end_col_offset_idx": 9,
"text": "Phone 2", "text": "Phone 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -159,7 +159,7 @@
"start_col_offset_idx": 9, "start_col_offset_idx": 9,
"end_col_offset_idx": 10, "end_col_offset_idx": 10,
"text": "Email", "text": "Email",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -171,7 +171,7 @@
"start_col_offset_idx": 10, "start_col_offset_idx": 10,
"end_col_offset_idx": 11, "end_col_offset_idx": 11,
"text": "Subscription Date", "text": "Subscription Date",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -183,7 +183,7 @@
"start_col_offset_idx": 11, "start_col_offset_idx": 11,
"end_col_offset_idx": 12, "end_col_offset_idx": 12,
"text": "Website", "text": "Website",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -920,7 +920,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Index", "text": "Index",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -932,7 +932,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Customer Id", "text": "Customer Id",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -944,7 +944,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "First Name", "text": "First Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -956,7 +956,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Last Name", "text": "Last Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -968,7 +968,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "Company", "text": "Company",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -980,7 +980,7 @@
"start_col_offset_idx": 5, "start_col_offset_idx": 5,
"end_col_offset_idx": 6, "end_col_offset_idx": 6,
"text": "City", "text": "City",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -992,7 +992,7 @@
"start_col_offset_idx": 6, "start_col_offset_idx": 6,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Country", "text": "Country",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1004,7 +1004,7 @@
"start_col_offset_idx": 7, "start_col_offset_idx": 7,
"end_col_offset_idx": 8, "end_col_offset_idx": 8,
"text": "Phone 1", "text": "Phone 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1016,7 +1016,7 @@
"start_col_offset_idx": 8, "start_col_offset_idx": 8,
"end_col_offset_idx": 9, "end_col_offset_idx": 9,
"text": "Phone 2", "text": "Phone 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1028,7 +1028,7 @@
"start_col_offset_idx": 9, "start_col_offset_idx": 9,
"end_col_offset_idx": 10, "end_col_offset_idx": 10,
"text": "Email", "text": "Email",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1040,7 +1040,7 @@
"start_col_offset_idx": 10, "start_col_offset_idx": 10,
"end_col_offset_idx": 11, "end_col_offset_idx": 11,
"text": "Subscription Date", "text": "Subscription Date",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1052,7 +1052,7 @@
"start_col_offset_idx": 11, "start_col_offset_idx": 11,
"end_col_offset_idx": 12, "end_col_offset_idx": 12,
"text": "Website", "text": "Website",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "1", "text": "1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -63,7 +63,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "2", "text": "2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -75,7 +75,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "3", "text": "3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -284,7 +284,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "1", "text": "1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -296,7 +296,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "2", "text": "2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -308,7 +308,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "3", "text": "3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Index", "text": "Index",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -63,7 +63,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Customer Id", "text": "Customer Id",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -75,7 +75,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "First Name", "text": "First Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -87,7 +87,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Last Name", "text": "Last Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -99,7 +99,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "Company", "text": "Company",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -111,7 +111,7 @@
"start_col_offset_idx": 5, "start_col_offset_idx": 5,
"end_col_offset_idx": 6, "end_col_offset_idx": 6,
"text": "City", "text": "City",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -123,7 +123,7 @@
"start_col_offset_idx": 6, "start_col_offset_idx": 6,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Country", "text": "Country",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -135,7 +135,7 @@
"start_col_offset_idx": 7, "start_col_offset_idx": 7,
"end_col_offset_idx": 8, "end_col_offset_idx": 8,
"text": "Phone 1", "text": "Phone 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -147,7 +147,7 @@
"start_col_offset_idx": 8, "start_col_offset_idx": 8,
"end_col_offset_idx": 9, "end_col_offset_idx": 9,
"text": "Phone 2", "text": "Phone 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -159,7 +159,7 @@
"start_col_offset_idx": 9, "start_col_offset_idx": 9,
"end_col_offset_idx": 10, "end_col_offset_idx": 10,
"text": "Email", "text": "Email",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -171,7 +171,7 @@
"start_col_offset_idx": 10, "start_col_offset_idx": 10,
"end_col_offset_idx": 11, "end_col_offset_idx": 11,
"text": "Subscription Date", "text": "Subscription Date",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -183,7 +183,7 @@
"start_col_offset_idx": 11, "start_col_offset_idx": 11,
"end_col_offset_idx": 12, "end_col_offset_idx": 12,
"text": "Website", "text": "Website",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -920,7 +920,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Index", "text": "Index",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -932,7 +932,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Customer Id", "text": "Customer Id",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -944,7 +944,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "First Name", "text": "First Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -956,7 +956,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Last Name", "text": "Last Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -968,7 +968,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "Company", "text": "Company",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -980,7 +980,7 @@
"start_col_offset_idx": 5, "start_col_offset_idx": 5,
"end_col_offset_idx": 6, "end_col_offset_idx": 6,
"text": "City", "text": "City",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -992,7 +992,7 @@
"start_col_offset_idx": 6, "start_col_offset_idx": 6,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Country", "text": "Country",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1004,7 +1004,7 @@
"start_col_offset_idx": 7, "start_col_offset_idx": 7,
"end_col_offset_idx": 8, "end_col_offset_idx": 8,
"text": "Phone 1", "text": "Phone 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1016,7 +1016,7 @@
"start_col_offset_idx": 8, "start_col_offset_idx": 8,
"end_col_offset_idx": 9, "end_col_offset_idx": 9,
"text": "Phone 2", "text": "Phone 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1028,7 +1028,7 @@
"start_col_offset_idx": 9, "start_col_offset_idx": 9,
"end_col_offset_idx": 10, "end_col_offset_idx": 10,
"text": "Email", "text": "Email",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1040,7 +1040,7 @@
"start_col_offset_idx": 10, "start_col_offset_idx": 10,
"end_col_offset_idx": 11, "end_col_offset_idx": 11,
"text": "Subscription Date", "text": "Subscription Date",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1052,7 +1052,7 @@
"start_col_offset_idx": 11, "start_col_offset_idx": 11,
"end_col_offset_idx": 12, "end_col_offset_idx": 12,
"text": "Website", "text": "Website",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Index", "text": "Index",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -63,7 +63,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Customer Id", "text": "Customer Id",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -75,7 +75,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "First Name", "text": "First Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -87,7 +87,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Last Name", "text": "Last Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -99,7 +99,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "Company", "text": "Company",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -111,7 +111,7 @@
"start_col_offset_idx": 5, "start_col_offset_idx": 5,
"end_col_offset_idx": 6, "end_col_offset_idx": 6,
"text": "City", "text": "City",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -123,7 +123,7 @@
"start_col_offset_idx": 6, "start_col_offset_idx": 6,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Country", "text": "Country",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -135,7 +135,7 @@
"start_col_offset_idx": 7, "start_col_offset_idx": 7,
"end_col_offset_idx": 8, "end_col_offset_idx": 8,
"text": "Phone 1", "text": "Phone 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -147,7 +147,7 @@
"start_col_offset_idx": 8, "start_col_offset_idx": 8,
"end_col_offset_idx": 9, "end_col_offset_idx": 9,
"text": "Phone 2", "text": "Phone 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -159,7 +159,7 @@
"start_col_offset_idx": 9, "start_col_offset_idx": 9,
"end_col_offset_idx": 10, "end_col_offset_idx": 10,
"text": "Email", "text": "Email",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -171,7 +171,7 @@
"start_col_offset_idx": 10, "start_col_offset_idx": 10,
"end_col_offset_idx": 11, "end_col_offset_idx": 11,
"text": "Subscription Date", "text": "Subscription Date",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -183,7 +183,7 @@
"start_col_offset_idx": 11, "start_col_offset_idx": 11,
"end_col_offset_idx": 12, "end_col_offset_idx": 12,
"text": "Website", "text": "Website",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -920,7 +920,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Index", "text": "Index",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -932,7 +932,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Customer Id", "text": "Customer Id",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -944,7 +944,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "First Name", "text": "First Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -956,7 +956,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Last Name", "text": "Last Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -968,7 +968,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "Company", "text": "Company",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -980,7 +980,7 @@
"start_col_offset_idx": 5, "start_col_offset_idx": 5,
"end_col_offset_idx": 6, "end_col_offset_idx": 6,
"text": "City", "text": "City",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -992,7 +992,7 @@
"start_col_offset_idx": 6, "start_col_offset_idx": 6,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Country", "text": "Country",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1004,7 +1004,7 @@
"start_col_offset_idx": 7, "start_col_offset_idx": 7,
"end_col_offset_idx": 8, "end_col_offset_idx": 8,
"text": "Phone 1", "text": "Phone 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1016,7 +1016,7 @@
"start_col_offset_idx": 8, "start_col_offset_idx": 8,
"end_col_offset_idx": 9, "end_col_offset_idx": 9,
"text": "Phone 2", "text": "Phone 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1028,7 +1028,7 @@
"start_col_offset_idx": 9, "start_col_offset_idx": 9,
"end_col_offset_idx": 10, "end_col_offset_idx": 10,
"text": "Email", "text": "Email",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1040,7 +1040,7 @@
"start_col_offset_idx": 10, "start_col_offset_idx": 10,
"end_col_offset_idx": 11, "end_col_offset_idx": 11,
"text": "Subscription Date", "text": "Subscription Date",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1052,7 +1052,7 @@
"start_col_offset_idx": 11, "start_col_offset_idx": 11,
"end_col_offset_idx": 12, "end_col_offset_idx": 12,
"text": "Website", "text": "Website",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Index", "text": "Index",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -63,7 +63,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Customer Id", "text": "Customer Id",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -75,7 +75,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "First Name", "text": "First Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -87,7 +87,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Last Name", "text": "Last Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -99,7 +99,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "Company", "text": "Company",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -111,7 +111,7 @@
"start_col_offset_idx": 5, "start_col_offset_idx": 5,
"end_col_offset_idx": 6, "end_col_offset_idx": 6,
"text": "City", "text": "City",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -123,7 +123,7 @@
"start_col_offset_idx": 6, "start_col_offset_idx": 6,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Country", "text": "Country",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -135,7 +135,7 @@
"start_col_offset_idx": 7, "start_col_offset_idx": 7,
"end_col_offset_idx": 8, "end_col_offset_idx": 8,
"text": "Phone 1", "text": "Phone 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -147,7 +147,7 @@
"start_col_offset_idx": 8, "start_col_offset_idx": 8,
"end_col_offset_idx": 9, "end_col_offset_idx": 9,
"text": "Phone 2", "text": "Phone 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -159,7 +159,7 @@
"start_col_offset_idx": 9, "start_col_offset_idx": 9,
"end_col_offset_idx": 10, "end_col_offset_idx": 10,
"text": "Email", "text": "Email",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -171,7 +171,7 @@
"start_col_offset_idx": 10, "start_col_offset_idx": 10,
"end_col_offset_idx": 11, "end_col_offset_idx": 11,
"text": "Subscription Date", "text": "Subscription Date",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -183,7 +183,7 @@
"start_col_offset_idx": 11, "start_col_offset_idx": 11,
"end_col_offset_idx": 12, "end_col_offset_idx": 12,
"text": "Website", "text": "Website",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -920,7 +920,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Index", "text": "Index",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -932,7 +932,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Customer Id", "text": "Customer Id",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -944,7 +944,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "First Name", "text": "First Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -956,7 +956,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Last Name", "text": "Last Name",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -968,7 +968,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "Company", "text": "Company",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -980,7 +980,7 @@
"start_col_offset_idx": 5, "start_col_offset_idx": 5,
"end_col_offset_idx": 6, "end_col_offset_idx": 6,
"text": "City", "text": "City",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -992,7 +992,7 @@
"start_col_offset_idx": 6, "start_col_offset_idx": 6,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Country", "text": "Country",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1004,7 +1004,7 @@
"start_col_offset_idx": 7, "start_col_offset_idx": 7,
"end_col_offset_idx": 8, "end_col_offset_idx": 8,
"text": "Phone 1", "text": "Phone 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1016,7 +1016,7 @@
"start_col_offset_idx": 8, "start_col_offset_idx": 8,
"end_col_offset_idx": 9, "end_col_offset_idx": 9,
"text": "Phone 2", "text": "Phone 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1028,7 +1028,7 @@
"start_col_offset_idx": 9, "start_col_offset_idx": 9,
"end_col_offset_idx": 10, "end_col_offset_idx": 10,
"text": "Email", "text": "Email",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1040,7 +1040,7 @@
"start_col_offset_idx": 10, "start_col_offset_idx": 10,
"end_col_offset_idx": 11, "end_col_offset_idx": 11,
"text": "Subscription Date", "text": "Subscription Date",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1052,7 +1052,7 @@
"start_col_offset_idx": 11, "start_col_offset_idx": 11,
"end_col_offset_idx": 12, "end_col_offset_idx": 12,
"text": "Website", "text": "Website",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "1", "text": "1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -63,7 +63,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "2", "text": "2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -75,7 +75,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "3", "text": "3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -87,7 +87,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "4", "text": "4",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -284,7 +284,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "1", "text": "1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -296,7 +296,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "2", "text": "2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -308,7 +308,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "3", "text": "3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -320,7 +320,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "4", "text": "4",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "1", "text": "1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -63,7 +63,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "2", "text": "2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -75,7 +75,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "3", "text": "3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -87,7 +87,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "4", "text": "4",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -308,7 +308,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "1", "text": "1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -320,7 +320,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "2", "text": "2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -332,7 +332,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "3", "text": "3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -344,7 +344,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "4", "text": "4",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },

View File

@ -0,0 +1,40 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: inline: group group
item-2 at level 2: paragraph: This is a word document and this is an inline equation:
item-3 at level 2: formula: A= \pi r^{2}
item-4 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
item-5 at level 1: paragraph:
item-6 at level 1: formula: a^{2}+b^{2}=c^{2} \text{ \texttimes } 23
item-7 at level 1: paragraph: And that is an equation by itself. Cheers!
item-8 at level 1: paragraph:
item-9 at level 1: paragraph: This is another equation:
item-10 at level 1: formula: f\left(x\right)=a_{0}+\sum_{n=1} ... })+b_{n}\sin(\frac{n \pi x}{L})\right)
item-11 at level 1: paragraph:
item-12 at level 1: paragraph: This is text. This is text. This ... s is text. This is text. This is text.
item-13 at level 1: paragraph:
item-14 at level 1: paragraph:
item-15 at level 1: inline: group group
item-16 at level 2: paragraph: This is a word document and this is an inline equation:
item-17 at level 2: formula: A= \pi r^{2}
item-18 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
item-19 at level 1: paragraph:
item-20 at level 1: formula: \left(x+a\right)^{n}=\sum_{k=0}^ ... ac{}{}{0pt}{}{n}{k}\right)x^{k}a^{n-k}
item-21 at level 1: paragraph:
item-22 at level 1: paragraph: And that is an equation by itself. Cheers!
item-23 at level 1: paragraph:
item-24 at level 1: paragraph: This is another equation:
item-25 at level 1: paragraph:
item-26 at level 1: formula: \left(1+x\right)^{n}=1+\frac{nx} ... ght)x^{2}}{2!}+ \text{ \textellipsis }
item-27 at level 1: paragraph:
item-28 at level 1: paragraph: This is text. This is text. This ... s is text. This is text. This is text.
item-29 at level 1: paragraph:
item-30 at level 1: paragraph:
item-31 at level 1: inline: group group
item-32 at level 2: paragraph: This is a word document and this is an inline equation:
item-33 at level 2: formula: A= \pi r^{2}
item-34 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
item-35 at level 1: paragraph:
item-36 at level 1: formula: e^{x}=1+\frac{x}{1!}+\frac{x^{2} ... xtellipsis } , - \infty < x < \infty
item-37 at level 1: paragraph:
item-38 at level 1: paragraph: And that is an equation by itself. Cheers!
item-39 at level 1: paragraph:

View File

@ -0,0 +1,616 @@
{
"schema_name": "DoclingDocument",
"version": "1.2.0",
"name": "equations",
"origin": {
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"binary_hash": 11121138535595486899,
"filename": "equations.docx"
},
"furniture": {
"self_ref": "#/furniture",
"children": [],
"content_layer": "furniture",
"name": "_root_",
"label": "unspecified"
},
"body": {
"self_ref": "#/body",
"children": [
{
"$ref": "#/groups/0"
},
{
"$ref": "#/texts/3"
},
{
"$ref": "#/texts/4"
},
{
"$ref": "#/texts/5"
},
{
"$ref": "#/texts/6"
},
{
"$ref": "#/texts/7"
},
{
"$ref": "#/texts/8"
},
{
"$ref": "#/texts/9"
},
{
"$ref": "#/texts/10"
},
{
"$ref": "#/texts/11"
},
{
"$ref": "#/texts/12"
},
{
"$ref": "#/groups/1"
},
{
"$ref": "#/texts/16"
},
{
"$ref": "#/texts/17"
},
{
"$ref": "#/texts/18"
},
{
"$ref": "#/texts/19"
},
{
"$ref": "#/texts/20"
},
{
"$ref": "#/texts/21"
},
{
"$ref": "#/texts/22"
},
{
"$ref": "#/texts/23"
},
{
"$ref": "#/texts/24"
},
{
"$ref": "#/texts/25"
},
{
"$ref": "#/texts/26"
},
{
"$ref": "#/texts/27"
},
{
"$ref": "#/groups/2"
},
{
"$ref": "#/texts/31"
},
{
"$ref": "#/texts/32"
},
{
"$ref": "#/texts/33"
},
{
"$ref": "#/texts/34"
},
{
"$ref": "#/texts/35"
}
],
"content_layer": "body",
"name": "_root_",
"label": "unspecified"
},
"groups": [
{
"self_ref": "#/groups/0",
"parent": {
"$ref": "#/body"
},
"children": [
{
"$ref": "#/texts/0"
},
{
"$ref": "#/texts/1"
},
{
"$ref": "#/texts/2"
}
],
"content_layer": "body",
"name": "group",
"label": "inline"
},
{
"self_ref": "#/groups/1",
"parent": {
"$ref": "#/body"
},
"children": [
{
"$ref": "#/texts/13"
},
{
"$ref": "#/texts/14"
},
{
"$ref": "#/texts/15"
}
],
"content_layer": "body",
"name": "group",
"label": "inline"
},
{
"self_ref": "#/groups/2",
"parent": {
"$ref": "#/body"
},
"children": [
{
"$ref": "#/texts/28"
},
{
"$ref": "#/texts/29"
},
{
"$ref": "#/texts/30"
}
],
"content_layer": "body",
"name": "group",
"label": "inline"
}
],
"texts": [
{
"self_ref": "#/texts/0",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is a word document and this is an inline equation: ",
"text": "This is a word document and this is an inline equation: "
},
{
"self_ref": "#/texts/1",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "formula",
"prov": [],
"orig": "A= \\pi r^{2} ",
"text": "A= \\pi r^{2} "
},
{
"self_ref": "#/texts/2",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": ". If instead, I want an equation by line, I can do this:",
"text": ". If instead, I want an equation by line, I can do this:"
},
{
"self_ref": "#/texts/3",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/4",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "formula",
"prov": [],
"orig": "a^{2}+b^{2}=c^{2} \\text{ \\texttimes } 23",
"text": "a^{2}+b^{2}=c^{2} \\text{ \\texttimes } 23"
},
{
"self_ref": "#/texts/5",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "And that is an equation by itself. Cheers!",
"text": "And that is an equation by itself. Cheers!"
},
{
"self_ref": "#/texts/6",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/7",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is another equation:",
"text": "This is another equation:"
},
{
"self_ref": "#/texts/8",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "formula",
"prov": [],
"orig": "f\\left(x\\right)=a_{0}+\\sum_{n=1}^{ \\infty }\\left(a_{n}\\cos(\\frac{n \\pi x}{L})+b_{n}\\sin(\\frac{n \\pi x}{L})\\right)",
"text": "f\\left(x\\right)=a_{0}+\\sum_{n=1}^{ \\infty }\\left(a_{n}\\cos(\\frac{n \\pi x}{L})+b_{n}\\sin(\\frac{n \\pi x}{L})\\right)"
},
{
"self_ref": "#/texts/9",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/10",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.",
"text": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text."
},
{
"self_ref": "#/texts/11",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/12",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/13",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is a word document and this is an inline equation: ",
"text": "This is a word document and this is an inline equation: "
},
{
"self_ref": "#/texts/14",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "formula",
"prov": [],
"orig": "A= \\pi r^{2} ",
"text": "A= \\pi r^{2} "
},
{
"self_ref": "#/texts/15",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": ". If instead, I want an equation by line, I can do this:",
"text": ". If instead, I want an equation by line, I can do this:"
},
{
"self_ref": "#/texts/16",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/17",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "formula",
"prov": [],
"orig": "\\left(x+a\\right)^{n}=\\sum_{k=0}^{n}\\left(\\genfrac{}{}{0pt}{}{n}{k}\\right)x^{k}a^{n-k}",
"text": "\\left(x+a\\right)^{n}=\\sum_{k=0}^{n}\\left(\\genfrac{}{}{0pt}{}{n}{k}\\right)x^{k}a^{n-k}"
},
{
"self_ref": "#/texts/18",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/19",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "And that is an equation by itself. Cheers!",
"text": "And that is an equation by itself. Cheers!"
},
{
"self_ref": "#/texts/20",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/21",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is another equation:",
"text": "This is another equation:"
},
{
"self_ref": "#/texts/22",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/23",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "formula",
"prov": [],
"orig": "\\left(1+x\\right)^{n}=1+\\frac{nx}{1!}+\\frac{n\\left(n-1\\right)x^{2}}{2!}+ \\text{ \\textellipsis }",
"text": "\\left(1+x\\right)^{n}=1+\\frac{nx}{1!}+\\frac{n\\left(n-1\\right)x^{2}}{2!}+ \\text{ \\textellipsis }"
},
{
"self_ref": "#/texts/24",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/25",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.",
"text": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text."
},
{
"self_ref": "#/texts/26",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/27",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/28",
"parent": {
"$ref": "#/groups/2"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "This is a word document and this is an inline equation: ",
"text": "This is a word document and this is an inline equation: "
},
{
"self_ref": "#/texts/29",
"parent": {
"$ref": "#/groups/2"
},
"children": [],
"content_layer": "body",
"label": "formula",
"prov": [],
"orig": "A= \\pi r^{2} ",
"text": "A= \\pi r^{2} "
},
{
"self_ref": "#/texts/30",
"parent": {
"$ref": "#/groups/2"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": ". If instead, I want an equation by line, I can do this:",
"text": ". If instead, I want an equation by line, I can do this:"
},
{
"self_ref": "#/texts/31",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/32",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "formula",
"prov": [],
"orig": "e^{x}=1+\\frac{x}{1!}+\\frac{x^{2}}{2!}+\\frac{x^{3}}{3!}+ \\text{ \\textellipsis } , - \\infty < x < \\infty",
"text": "e^{x}=1+\\frac{x}{1!}+\\frac{x^{2}}{2!}+\\frac{x^{3}}{3!}+ \\text{ \\textellipsis } , - \\infty < x < \\infty"
},
{
"self_ref": "#/texts/33",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
},
{
"self_ref": "#/texts/34",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "And that is an equation by itself. Cheers!",
"text": "And that is an equation by itself. Cheers!"
},
{
"self_ref": "#/texts/35",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "paragraph",
"prov": [],
"orig": "",
"text": ""
}
],
"pictures": [],
"tables": [],
"key_value_items": [],
"form_items": [],
"pages": {}
}

View File

@ -0,0 +1,29 @@
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
$$a^{2}+b^{2}=c^{2} \text{ \texttimes } 23$$
And that is an equation by itself. Cheers!
This is another equation:
$$f\left(x\right)=a_{0}+\sum_{n=1}^{ \infty }\left(a_{n}\cos(\frac{n \pi x}{L})+b_{n}\sin(\frac{n \pi x}{L})\right)$$
This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
$$\left(x+a\right)^{n}=\sum_{k=0}^{n}\left(\genfrac{}{}{0pt}{}{n}{k}\right)x^{k}a^{n-k}$$
And that is an equation by itself. Cheers!
This is another equation:
$$\left(1+x\right)^{n}=1+\frac{nx}{1!}+\frac{n\left(n-1\right)x^{2}}{2!}+ \text{ \textellipsis }$$
This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
$$e^{x}=1+\frac{x}{1!}+\frac{x^{2}}{2!}+\frac{x^{3}}{3!}+ \text{ \textellipsis } , - \infty < x < \infty$$
And that is an equation by itself. Cheers!

View File

@ -344,7 +344,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 1", "text": "Header 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -356,7 +356,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 2", "text": "Header 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -368,7 +368,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 3", "text": "Header 3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -493,7 +493,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 1", "text": "Header 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -505,7 +505,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 2", "text": "Header 2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -517,7 +517,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 3", "text": "Header 3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -68,7 +68,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 1", "text": "Header 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -80,7 +80,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 2 & 3 (colspan)", "text": "Header 2 & 3 (colspan)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -181,7 +181,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 1", "text": "Header 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -193,7 +193,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 2 & 3 (colspan)", "text": "Header 2 & 3 (colspan)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -205,7 +205,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 2 & 3 (colspan)", "text": "Header 2 & 3 (colspan)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -68,7 +68,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 1", "text": "Header 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -80,7 +80,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 2 & 3 (colspan)", "text": "Header 2 & 3 (colspan)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -181,7 +181,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 1", "text": "Header 1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -193,7 +193,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 2 & 3 (colspan)", "text": "Header 2 & 3 (colspan)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -205,7 +205,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 2 & 3 (colspan)", "text": "Header 2 & 3 (colspan)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -0,0 +1,22 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: list: group list
item-2 at level 2: list_item: Asia
item-3 at level 3: list: group list
item-4 at level 4: list_item: China
item-5 at level 4: list_item: Japan
item-6 at level 4: list_item: Thailand
item-7 at level 2: list_item: Europe
item-8 at level 3: list: group list
item-9 at level 4: list_item: UK
item-10 at level 4: list_item: Germany
item-11 at level 4: list_item: Switzerland
item-12 at level 5: list: group list
item-13 at level 6: list: group list
item-14 at level 7: list_item: Bern
item-15 at level 7: list_item: Aargau
item-16 at level 4: list_item: Italy
item-17 at level 5: list: group list
item-18 at level 6: list: group list
item-19 at level 7: list_item: Piedmont
item-20 at level 7: list_item: Liguria
item-21 at level 2: list_item: Africa

View File

@ -0,0 +1,374 @@
{
"schema_name": "DoclingDocument",
"version": "1.2.0",
"name": "example_07",
"origin": {
"mimetype": "text/html",
"binary_hash": 623628706615267627,
"filename": "example_07.html"
},
"furniture": {
"self_ref": "#/furniture",
"children": [],
"content_layer": "furniture",
"name": "_root_",
"label": "unspecified"
},
"body": {
"self_ref": "#/body",
"children": [
{
"$ref": "#/groups/0"
}
],
"content_layer": "body",
"name": "_root_",
"label": "unspecified"
},
"groups": [
{
"self_ref": "#/groups/0",
"parent": {
"$ref": "#/body"
},
"children": [
{
"$ref": "#/texts/0"
},
{
"$ref": "#/texts/4"
},
{
"$ref": "#/texts/13"
}
],
"content_layer": "body",
"name": "list",
"label": "list"
},
{
"self_ref": "#/groups/1",
"parent": {
"$ref": "#/texts/0"
},
"children": [
{
"$ref": "#/texts/1"
},
{
"$ref": "#/texts/2"
},
{
"$ref": "#/texts/3"
}
],
"content_layer": "body",
"name": "list",
"label": "list"
},
{
"self_ref": "#/groups/2",
"parent": {
"$ref": "#/texts/4"
},
"children": [
{
"$ref": "#/texts/5"
},
{
"$ref": "#/texts/6"
},
{
"$ref": "#/texts/7"
},
{
"$ref": "#/texts/10"
}
],
"content_layer": "body",
"name": "list",
"label": "list"
},
{
"self_ref": "#/groups/3",
"parent": {
"$ref": "#/texts/7"
},
"children": [
{
"$ref": "#/groups/4"
}
],
"content_layer": "body",
"name": "list",
"label": "list"
},
{
"self_ref": "#/groups/4",
"parent": {
"$ref": "#/groups/3"
},
"children": [
{
"$ref": "#/texts/8"
},
{
"$ref": "#/texts/9"
}
],
"content_layer": "body",
"name": "list",
"label": "list"
},
{
"self_ref": "#/groups/5",
"parent": {
"$ref": "#/texts/10"
},
"children": [
{
"$ref": "#/groups/6"
}
],
"content_layer": "body",
"name": "list",
"label": "list"
},
{
"self_ref": "#/groups/6",
"parent": {
"$ref": "#/groups/5"
},
"children": [
{
"$ref": "#/texts/11"
},
{
"$ref": "#/texts/12"
}
],
"content_layer": "body",
"name": "list",
"label": "list"
}
],
"texts": [
{
"self_ref": "#/texts/0",
"parent": {
"$ref": "#/groups/0"
},
"children": [
{
"$ref": "#/groups/1"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Asia",
"text": "Asia",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/1",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "China",
"text": "China",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/2",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Japan",
"text": "Japan",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/3",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Thailand",
"text": "Thailand",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/4",
"parent": {
"$ref": "#/groups/0"
},
"children": [
{
"$ref": "#/groups/2"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Europe",
"text": "Europe",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/5",
"parent": {
"$ref": "#/groups/2"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "UK",
"text": "UK",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/6",
"parent": {
"$ref": "#/groups/2"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Germany",
"text": "Germany",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/7",
"parent": {
"$ref": "#/groups/2"
},
"children": [
{
"$ref": "#/groups/3"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Switzerland",
"text": "Switzerland",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/8",
"parent": {
"$ref": "#/groups/4"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Bern",
"text": "Bern",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/9",
"parent": {
"$ref": "#/groups/4"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Aargau",
"text": "Aargau",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/10",
"parent": {
"$ref": "#/groups/2"
},
"children": [
{
"$ref": "#/groups/5"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Italy",
"text": "Italy",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/11",
"parent": {
"$ref": "#/groups/6"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Piedmont",
"text": "Piedmont",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/12",
"parent": {
"$ref": "#/groups/6"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Liguria",
"text": "Liguria",
"enumerated": false,
"marker": "-"
},
{
"self_ref": "#/texts/13",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Africa",
"text": "Africa",
"enumerated": false,
"marker": "-"
}
],
"pictures": [],
"tables": [],
"key_value_items": [],
"form_items": [],
"pages": {}
}

View File

@ -0,0 +1,14 @@
- Asia
- China
- Japan
- Thailand
- Europe
- UK
- Germany
- Switzerland
- Bern
- Aargau
- Italy
- Piedmont
- Liguria
- Africa

View File

@ -960,7 +960,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Class1", "text": "Class1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -972,7 +972,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Class2", "text": "Class2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1385,7 +1385,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Class1", "text": "Class1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1397,7 +1397,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Class1", "text": "Class1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1409,7 +1409,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Class1", "text": "Class1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1421,7 +1421,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Class2", "text": "Class2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1433,7 +1433,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Class2", "text": "Class2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1445,7 +1445,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 7, "end_col_offset_idx": 7,
"text": "Class2", "text": "Class2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -176,7 +176,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Tab1", "text": "Tab1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -188,7 +188,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Tab2", "text": "Tab2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -200,7 +200,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Tab3", "text": "Tab3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -289,7 +289,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Tab1", "text": "Tab1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -301,7 +301,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Tab2", "text": "Tab2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -313,7 +313,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Tab3", "text": "Tab3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -136,7 +136,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "first ", "text": "first ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -148,7 +148,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "second ", "text": "second ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -160,7 +160,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "third", "text": "third",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -393,7 +393,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "first ", "text": "first ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -405,7 +405,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "second ", "text": "second ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -417,7 +417,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "third", "text": "third",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -675,7 +675,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "col-1", "text": "col-1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -687,7 +687,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "col-2", "text": "col-2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -699,7 +699,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "col-3", "text": "col-3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -711,7 +711,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "col-4", "text": "col-4",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1112,7 +1112,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "col-1", "text": "col-1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1124,7 +1124,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "col-2", "text": "col-2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1136,7 +1136,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "col-3", "text": "col-3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1148,7 +1148,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "col-4", "text": "col-4",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -1578,7 +1578,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "col-1", "text": "col-1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1590,7 +1590,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "col-2", "text": "col-2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1602,7 +1602,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "col-3", "text": "col-3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1763,7 +1763,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "col-1", "text": "col-1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1775,7 +1775,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "col-2", "text": "col-2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1787,7 +1787,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "col-3", "text": "col-3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -1969,7 +1969,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "col-1", "text": "col-1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1981,7 +1981,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "col-2", "text": "col-2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1993,7 +1993,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "col-3", "text": "col-3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2154,7 +2154,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "col-1", "text": "col-1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2166,7 +2166,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "col-2", "text": "col-2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2178,7 +2178,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "col-3", "text": "col-3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -2360,7 +2360,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "first ", "text": "first ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2372,7 +2372,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "header", "text": "header",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2545,7 +2545,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "first ", "text": "first ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2557,7 +2557,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "header", "text": "header",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2569,7 +2569,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "header", "text": "header",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -2583,7 +2583,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "first ", "text": "first ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2827,7 +2827,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "first (f)", "text": "first (f)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -2839,7 +2839,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "header (f)", "text": "header (f)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -3012,7 +3012,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "first (f)", "text": "first (f)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -3024,7 +3024,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "header (f)", "text": "header (f)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -3036,7 +3036,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "header (f)", "text": "header (f)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -3050,7 +3050,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "first (f)", "text": "first (f)",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },

View File

@ -7914,7 +7914,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Duck\n", "text": "Duck\n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -7950,7 +7950,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Scientific classification \n", "text": "Scientific classification \n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -8130,7 +8130,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Subfamilies\n", "text": "Subfamilies\n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -8159,7 +8159,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Duck\n", "text": "Duck\n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -8171,7 +8171,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Duck\n", "text": "Duck\n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -8237,7 +8237,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Scientific classification \n", "text": "Scientific classification \n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -8249,7 +8249,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Scientific classification \n", "text": "Scientific classification \n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -8445,7 +8445,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Subfamilies\n", "text": "Subfamilies\n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -8457,7 +8457,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Subfamilies\n", "text": "Subfamilies\n",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -8513,7 +8513,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Authority control databases ", "text": "Authority control databases ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -8578,7 +8578,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Authority control databases ", "text": "Authority control databases ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -8590,7 +8590,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Authority control databases ", "text": "Authority control databases ",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -490,7 +490,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "", "text": "",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -502,7 +502,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Food", "text": "Food",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -514,7 +514,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Calories per portion", "text": "Calories per portion",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -639,7 +639,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "", "text": "",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -651,7 +651,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Food", "text": "Food",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -663,7 +663,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Calories per portion", "text": "Calories per portion",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -71,19 +71,19 @@
</head> </head>
<h2>Test with tables</h2> <h2>Test with tables</h2>
<p>A uniform table</p> <p>A uniform table</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td>Cell 1.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.1</td><td>Cell 2.2</td></tr></tbody></table> <table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th></tr><tr><td>Cell 1.0</td><td>Cell 1.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.1</td><td>Cell 2.2</td></tr></tbody></table>
<p></p> <p></p>
<p>A non-uniform table with horizontal spans</p> <p>A non-uniform table with horizontal spans</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td></tr></tbody></table> <table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td></tr></tbody></table>
<p></p> <p></p>
<p>A non-uniform table with horizontal spans in inner columns</p> <p>A non-uniform table with horizontal spans in inner columns</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td>Header 0.3</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td><td>Cell 1.3</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td><td>Cell 2.3</td></tr></tbody></table> <table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th><th>Header 0.3</th></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td><td>Cell 1.3</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td><td>Cell 2.3</td></tr></tbody></table>
<p></p> <p></p>
<p>A non-uniform table with vertical spans</p> <p>A non-uniform table with vertical spans</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td></tr></tbody></table> <table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td></tr></tbody></table>
<p></p> <p></p>
<p>A non-uniform table with all kinds of spans and empty cells</p> <p>A non-uniform table with all kinds of spans and empty cells</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td></td><td></td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td><td></td><td></td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td><td></td><td></td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td><td rowspan="3"></td><td></td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td><td rowspan="2">Merged Cell 4.4 5.4</td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td colspan="5"></td></tr><tr><td></td><td></td><td></td><td></td><td>Cell 8.4</td></tr></tbody></table> <table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th><th></th><th></th></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td><td></td><td></td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td><td></td><td></td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td><td rowspan="3"></td><td></td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td><td rowspan="2">Merged Cell 4.4 5.4</td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td colspan="5"></td></tr><tr><td></td><td></td><td></td><td></td><td>Cell 8.4</td></tr></tbody></table>
<p></p> <p></p>
<p></p> <p></p>
</html> </html>

View File

@ -261,7 +261,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -273,7 +273,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -285,7 +285,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -374,7 +374,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -386,7 +386,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -398,7 +398,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -504,7 +504,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -516,7 +516,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -528,7 +528,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -593,7 +593,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -605,7 +605,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -617,7 +617,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -723,7 +723,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -735,7 +735,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -747,7 +747,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -759,7 +759,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Header 0.3", "text": "Header 0.3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -848,7 +848,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -860,7 +860,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -872,7 +872,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -884,7 +884,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "Header 0.3", "text": "Header 0.3",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -1014,7 +1014,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1026,7 +1026,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1038,7 +1038,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1175,7 +1175,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1187,7 +1187,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1199,7 +1199,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }
@ -1381,7 +1381,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1393,7 +1393,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1405,7 +1405,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1417,7 +1417,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "", "text": "",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1429,7 +1429,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "", "text": "",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1818,7 +1818,7 @@
"start_col_offset_idx": 0, "start_col_offset_idx": 0,
"end_col_offset_idx": 1, "end_col_offset_idx": 1,
"text": "Header 0.0", "text": "Header 0.0",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1830,7 +1830,7 @@
"start_col_offset_idx": 1, "start_col_offset_idx": 1,
"end_col_offset_idx": 2, "end_col_offset_idx": 2,
"text": "Header 0.1", "text": "Header 0.1",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1842,7 +1842,7 @@
"start_col_offset_idx": 2, "start_col_offset_idx": 2,
"end_col_offset_idx": 3, "end_col_offset_idx": 3,
"text": "Header 0.2", "text": "Header 0.2",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1854,7 +1854,7 @@
"start_col_offset_idx": 3, "start_col_offset_idx": 3,
"end_col_offset_idx": 4, "end_col_offset_idx": 4,
"text": "", "text": "",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
}, },
@ -1866,7 +1866,7 @@
"start_col_offset_idx": 4, "start_col_offset_idx": 4,
"end_col_offset_idx": 5, "end_col_offset_idx": 5,
"text": "", "text": "",
"column_header": false, "column_header": true,
"row_header": false, "row_header": false,
"row_section": false "row_section": false
} }

View File

@ -0,0 +1,40 @@
<html>
<body>
<ul>
<li>Asia
<ul>
<li>China</li>
<li>Japan</li>
<li>Thailand</li>
</ul>
</li>
<li>Europe
<ul>
<li>UK</li>
<li>Germany</li>
<li>Switzerland
<ul>
<li style="list-style-type: none;">
<ul>
<li>Bern</li>
<li>Aargau</li>
</ul>
</li>
</ul>
</li>
<li>Italy
<ul>
<li style="list-style-type: none;">
<ul>
<li>Piedmont</li>
<li>Liguria</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>Africa</li>
</ul>
</body>
</html>

View File

@ -59,7 +59,11 @@ def test_e2e_valid_csv_conversions():
pred_itxt, str(gt_path) + ".itxt" pred_itxt, str(gt_path) + ".itxt"
), "export to indented-text" ), "export to indented-text"
assert verify_document(doc, str(gt_path) + ".json"), "export to json" assert verify_document(
pred_doc=doc,
gtfile=str(gt_path) + ".json",
generate=GENERATE,
), "export to json"
def test_e2e_invalid_csv_conversions(): def test_e2e_invalid_csv_conversions():

View File

@ -91,4 +91,8 @@ def test_e2e_docx_conversions():
if docx_path.name == "word_tables.docx": if docx_path.name == "word_tables.docx":
pred_html: str = doc.export_to_html() pred_html: str = doc.export_to_html()
assert verify_export(pred_html, str(gt_path) + ".html"), "export to html" assert verify_export(
pred_text=pred_html,
gtfile=str(gt_path) + ".html",
generate=GENERATE,
), "export to html"

View File

@ -179,7 +179,7 @@ def test_guess_format(tmp_path):
# Non-Docling JSON # Non-Docling JSON
# TODO: Docling JSON is currently the single supported JSON flavor and the pipeline # TODO: Docling JSON is currently the single supported JSON flavor and the pipeline
# will try to validate *any* JSON (based on suffix/MIME) as Docling JSON; proper # will try to validate *any* JSON (based on suffix/MIME) as Docling JSON; proper
# disambiguation seen as part of https://github.com/DS4SD/docling/issues/802 # disambiguation seen as part of https://github.com/docling-project/docling/issues/802
test_str = "{}" test_str = "{}"
stream = DocumentStream(name="test.json", stream=BytesIO(f"{test_str}".encode())) stream = DocumentStream(name="test.json", stream=BytesIO(f"{test_str}".encode()))
assert dci._guess_format(stream) == InputFormat.JSON_DOCLING assert dci._guess_format(stream) == InputFormat.JSON_DOCLING