mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-30 14:04:27 +00:00
Merge branch 'docling-project:main' into main
This commit is contained in:
commit
9d95f0211e
2
.github/SECURITY.md
vendored
2
.github/SECURITY.md
vendored
@ -20,4 +20,4 @@ After the initial reply to your report, the security team will keep you informed
|
||||
|
||||
## Security Alerts
|
||||
|
||||
We will send announcements of security vulnerabilities and steps to remediate on the [Docling announcements](https://github.com/DS4SD/docling/discussions/categories/announcements).
|
||||
We will send announcements of security vulnerabilities and steps to remediate on the [Docling announcements](https://github.com/docling-project/docling/discussions/categories/announcements).
|
||||
|
2
.github/workflows/ci-docs.yml
vendored
2
.github/workflows/ci-docs.yml
vendored
@ -10,7 +10,7 @@ on:
|
||||
|
||||
jobs:
|
||||
build-docs:
|
||||
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
|
||||
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
|
||||
uses: ./.github/workflows/docs.yml
|
||||
with:
|
||||
deploy: false
|
||||
|
2
.github/workflows/ci.yml
vendored
2
.github/workflows/ci.yml
vendored
@ -15,5 +15,5 @@ env:
|
||||
|
||||
jobs:
|
||||
code-checks:
|
||||
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
|
||||
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
|
||||
uses: ./.github/workflows/checks.yml
|
||||
|
1086
CHANGELOG.md
1086
CHANGELOG.md
File diff suppressed because it is too large
Load Diff
@ -2,13 +2,13 @@
|
||||
Our project welcomes external contributions. If you have an itch, please feel
|
||||
free to scratch it.
|
||||
|
||||
To contribute code or documentation, please submit a [pull request](https://github.com/DS4SD/docling/pulls).
|
||||
To contribute code or documentation, please submit a [pull request](https://github.com/docling-project/docling/pulls).
|
||||
|
||||
A good way to familiarize yourself with the codebase and contribution process is
|
||||
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/DS4SD/docling/issues).
|
||||
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/docling-project/docling/issues).
|
||||
Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us.
|
||||
|
||||
For general questions or support requests, please refer to the [discussion section](https://github.com/DS4SD/docling/discussions).
|
||||
For general questions or support requests, please refer to the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||
|
||||
**Note: We appreciate your effort and want to avoid situations where a contribution
|
||||
requires extensive rework (by you or by us), sits in the backlog for a long time, or
|
||||
@ -16,14 +16,14 @@ cannot be accepted at all!**
|
||||
|
||||
### Proposing New Features
|
||||
|
||||
If you would like to implement a new feature, please [raise an issue](https://github.com/DS4SD/docling/issues)
|
||||
If you would like to implement a new feature, please [raise an issue](https://github.com/docling-project/docling/issues)
|
||||
before sending a pull request so the feature can be discussed. This is to avoid
|
||||
you spending valuable time working on a feature that the project developers
|
||||
are not interested in accepting into the codebase.
|
||||
|
||||
### Fixing Bugs
|
||||
|
||||
If you would like to fix a bug, please [raise an issue](https://github.com/DS4SD/docling/issues) before sending a
|
||||
If you would like to fix a bug, please [raise an issue](https://github.com/docling-project/docling/issues) before sending a
|
||||
pull request so it can be tracked.
|
||||
|
||||
### Merge Approval
|
||||
@ -78,7 +78,7 @@ This project strictly adheres to using dependencies that are compatible with the
|
||||
|
||||
## Communication
|
||||
|
||||
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
|
||||
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||
|
||||
|
||||
|
||||
|
28
README.md
28
README.md
@ -1,6 +1,6 @@
|
||||
<p align="center">
|
||||
<a href="https://github.com/ds4sd/docling">
|
||||
<img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
|
||||
<a href="https://github.com/docling-project/docling">
|
||||
<img loading="lazy" alt="Docling" src="https://github.com/docling-project/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
|
||||
</a>
|
||||
</p>
|
||||
|
||||
@ -11,7 +11,7 @@
|
||||
</p>
|
||||
|
||||
[](https://arxiv.org/abs/2408.09869)
|
||||
[](https://ds4sd.github.io/docling/)
|
||||
[](https://docling-project.github.io/docling/)
|
||||
[](https://pypi.org/project/docling/)
|
||||
[](https://pypi.org/project/docling/)
|
||||
[](https://python-poetry.org/)
|
||||
@ -19,7 +19,7 @@
|
||||
[](https://pycqa.github.io/isort/)
|
||||
[](https://pydantic.dev)
|
||||
[](https://github.com/pre-commit/pre-commit)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://pepy.tech/projects/docling)
|
||||
|
||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||
@ -51,7 +51,7 @@ pip install docling
|
||||
|
||||
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
|
||||
|
||||
More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.
|
||||
More [detailed installation instructions](https://docling-project.github.io/docling/installation/) are available in the docs.
|
||||
|
||||
## Getting started
|
||||
|
||||
@ -66,28 +66,28 @@ result = converter.convert(source)
|
||||
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
||||
```
|
||||
|
||||
More [advanced usage options](https://ds4sd.github.io/docling/usage/) are available in
|
||||
More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
|
||||
the docs.
|
||||
|
||||
## Documentation
|
||||
|
||||
Check out Docling's [documentation](https://ds4sd.github.io/docling/), for details on
|
||||
Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
|
||||
installation, usage, concepts, recipes, extensions, and more.
|
||||
|
||||
## Examples
|
||||
|
||||
Go hands-on with our [examples](https://ds4sd.github.io/docling/examples/),
|
||||
Go hands-on with our [examples](https://docling-project.github.io/docling/examples/),
|
||||
demonstrating how to address different application use cases with Docling.
|
||||
|
||||
## Integrations
|
||||
|
||||
To further accelerate your AI application development, check out Docling's native
|
||||
[integrations](https://ds4sd.github.io/docling/integrations/) with popular frameworks
|
||||
[integrations](https://docling-project.github.io/docling/integrations/) with popular frameworks
|
||||
and tools.
|
||||
|
||||
## Get help and support
|
||||
|
||||
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
|
||||
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||
|
||||
## Technical report
|
||||
|
||||
@ -95,7 +95,7 @@ For more details on Docling's inner workings, check out the [Docling Technical R
|
||||
|
||||
## Contributing
|
||||
|
||||
Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
|
||||
Please read [Contributing to Docling](https://github.com/docling-project/docling/blob/main/CONTRIBUTING.md) for details.
|
||||
|
||||
## References
|
||||
|
||||
@ -123,6 +123,6 @@ For individual model usage, please refer to the model licenses found in the orig
|
||||
|
||||
Docling has been brought to you by IBM.
|
||||
|
||||
[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/
|
||||
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
|
||||
[integrations]: https://ds4sd.github.io/docling/integrations/
|
||||
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
|
||||
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
|
||||
[integrations]: https://docling-project.github.io/docling/integrations/
|
||||
|
@ -380,7 +380,7 @@ class AsciiDocBackend(DeclarativeDocumentBackend):
|
||||
end_row_offset_idx=row_idx + row_span,
|
||||
start_col_offset_idx=col_idx,
|
||||
end_col_offset_idx=col_idx + col_span,
|
||||
col_header=False,
|
||||
column_header=row_idx == 0,
|
||||
row_header=False,
|
||||
)
|
||||
data.table_cells.append(cell)
|
||||
|
@ -111,7 +111,7 @@ class CsvDocumentBackend(DeclarativeDocumentBackend):
|
||||
end_row_offset_idx=row_idx + 1,
|
||||
start_col_offset_idx=col_idx,
|
||||
end_col_offset_idx=col_idx + 1,
|
||||
col_header=row_idx == 0, # First row as header
|
||||
column_header=row_idx == 0, # First row as header
|
||||
row_header=False,
|
||||
)
|
||||
table_data.table_cells.append(cell)
|
||||
|
0
docling/backend/docx/__init__.py
Normal file
0
docling/backend/docx/__init__.py
Normal file
0
docling/backend/docx/latex/__init__.py
Normal file
0
docling/backend/docx/latex/__init__.py
Normal file
271
docling/backend/docx/latex/latex_dict.py
Normal file
271
docling/backend/docx/latex/latex_dict.py
Normal file
@ -0,0 +1,271 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/latex_dict.py
|
||||
On 23/01/2025
|
||||
"""
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
CHARS = ("{", "}", "_", "^", "#", "&", "$", "%", "~")
|
||||
|
||||
BLANK = ""
|
||||
BACKSLASH = "\\"
|
||||
ALN = "&"
|
||||
|
||||
CHR = {
|
||||
# Unicode : Latex Math Symbols
|
||||
# Top accents
|
||||
"\u0300": "\\grave{{{0}}}",
|
||||
"\u0301": "\\acute{{{0}}}",
|
||||
"\u0302": "\\hat{{{0}}}",
|
||||
"\u0303": "\\tilde{{{0}}}",
|
||||
"\u0304": "\\bar{{{0}}}",
|
||||
"\u0305": "\\overbar{{{0}}}",
|
||||
"\u0306": "\\breve{{{0}}}",
|
||||
"\u0307": "\\dot{{{0}}}",
|
||||
"\u0308": "\\ddot{{{0}}}",
|
||||
"\u0309": "\\ovhook{{{0}}}",
|
||||
"\u030a": "\\ocirc{{{0}}}}",
|
||||
"\u030c": "\\check{{{0}}}}",
|
||||
"\u0310": "\\candra{{{0}}}",
|
||||
"\u0312": "\\oturnedcomma{{{0}}}",
|
||||
"\u0315": "\\ocommatopright{{{0}}}",
|
||||
"\u031a": "\\droang{{{0}}}",
|
||||
"\u0338": "\\not{{{0}}}",
|
||||
"\u20d0": "\\leftharpoonaccent{{{0}}}",
|
||||
"\u20d1": "\\rightharpoonaccent{{{0}}}",
|
||||
"\u20d2": "\\vertoverlay{{{0}}}",
|
||||
"\u20d6": "\\overleftarrow{{{0}}}",
|
||||
"\u20d7": "\\vec{{{0}}}",
|
||||
"\u20db": "\\dddot{{{0}}}",
|
||||
"\u20dc": "\\ddddot{{{0}}}",
|
||||
"\u20e1": "\\overleftrightarrow{{{0}}}",
|
||||
"\u20e7": "\\annuity{{{0}}}",
|
||||
"\u20e9": "\\widebridgeabove{{{0}}}",
|
||||
"\u20f0": "\\asteraccent{{{0}}}",
|
||||
# Bottom accents
|
||||
"\u0330": "\\wideutilde{{{0}}}",
|
||||
"\u0331": "\\underbar{{{0}}}",
|
||||
"\u20e8": "\\threeunderdot{{{0}}}",
|
||||
"\u20ec": "\\underrightharpoondown{{{0}}}",
|
||||
"\u20ed": "\\underleftharpoondown{{{0}}}",
|
||||
"\u20ee": "\\underledtarrow{{{0}}}",
|
||||
"\u20ef": "\\underrightarrow{{{0}}}",
|
||||
# Over | group
|
||||
"\u23b4": "\\overbracket{{{0}}}",
|
||||
"\u23dc": "\\overparen{{{0}}}",
|
||||
"\u23de": "\\overbrace{{{0}}}",
|
||||
# Under| group
|
||||
"\u23b5": "\\underbracket{{{0}}}",
|
||||
"\u23dd": "\\underparen{{{0}}}",
|
||||
"\u23df": "\\underbrace{{{0}}}",
|
||||
}
|
||||
|
||||
CHR_BO = {
|
||||
# Big operators,
|
||||
"\u2140": "\\Bbbsum",
|
||||
"\u220f": "\\prod",
|
||||
"\u2210": "\\coprod",
|
||||
"\u2211": "\\sum",
|
||||
"\u222b": "\\int",
|
||||
"\u22c0": "\\bigwedge",
|
||||
"\u22c1": "\\bigvee",
|
||||
"\u22c2": "\\bigcap",
|
||||
"\u22c3": "\\bigcup",
|
||||
"\u2a00": "\\bigodot",
|
||||
"\u2a01": "\\bigoplus",
|
||||
"\u2a02": "\\bigotimes",
|
||||
}
|
||||
|
||||
T = {
|
||||
"\u2192": "\\rightarrow ",
|
||||
# Greek letters
|
||||
"\U0001d6fc": "\\alpha ",
|
||||
"\U0001d6fd": "\\beta ",
|
||||
"\U0001d6fe": "\\gamma ",
|
||||
"\U0001d6ff": "\\theta ",
|
||||
"\U0001d700": "\\epsilon ",
|
||||
"\U0001d701": "\\zeta ",
|
||||
"\U0001d702": "\\eta ",
|
||||
"\U0001d703": "\\theta ",
|
||||
"\U0001d704": "\\iota ",
|
||||
"\U0001d705": "\\kappa ",
|
||||
"\U0001d706": "\\lambda ",
|
||||
"\U0001d707": "\\m ",
|
||||
"\U0001d708": "\\n ",
|
||||
"\U0001d709": "\\xi ",
|
||||
"\U0001d70a": "\\omicron ",
|
||||
"\U0001d70b": "\\pi ",
|
||||
"\U0001d70c": "\\rho ",
|
||||
"\U0001d70d": "\\varsigma ",
|
||||
"\U0001d70e": "\\sigma ",
|
||||
"\U0001d70f": "\\ta ",
|
||||
"\U0001d710": "\\upsilon ",
|
||||
"\U0001d711": "\\phi ",
|
||||
"\U0001d712": "\\chi ",
|
||||
"\U0001d713": "\\psi ",
|
||||
"\U0001d714": "\\omega ",
|
||||
"\U0001d715": "\\partial ",
|
||||
"\U0001d716": "\\varepsilon ",
|
||||
"\U0001d717": "\\vartheta ",
|
||||
"\U0001d718": "\\varkappa ",
|
||||
"\U0001d719": "\\varphi ",
|
||||
"\U0001d71a": "\\varrho ",
|
||||
"\U0001d71b": "\\varpi ",
|
||||
# Relation symbols
|
||||
"\u2190": "\\leftarrow ",
|
||||
"\u2191": "\\uparrow ",
|
||||
"\u2192": "\\rightarrow ",
|
||||
"\u2193": "\\downright ",
|
||||
"\u2194": "\\leftrightarrow ",
|
||||
"\u2195": "\\updownarrow ",
|
||||
"\u2196": "\\nwarrow ",
|
||||
"\u2197": "\\nearrow ",
|
||||
"\u2198": "\\searrow ",
|
||||
"\u2199": "\\swarrow ",
|
||||
"\u22ee": "\\vdots ",
|
||||
"\u22ef": "\\cdots ",
|
||||
"\u22f0": "\\adots ",
|
||||
"\u22f1": "\\ddots ",
|
||||
"\u2260": "\\ne ",
|
||||
"\u2264": "\\leq ",
|
||||
"\u2265": "\\geq ",
|
||||
"\u2266": "\\leqq ",
|
||||
"\u2267": "\\geqq ",
|
||||
"\u2268": "\\lneqq ",
|
||||
"\u2269": "\\gneqq ",
|
||||
"\u226a": "\\ll ",
|
||||
"\u226b": "\\gg ",
|
||||
"\u2208": "\\in ",
|
||||
"\u2209": "\\notin ",
|
||||
"\u220b": "\\ni ",
|
||||
"\u220c": "\\nni ",
|
||||
# Ordinary symbols
|
||||
"\u221e": "\\infty ",
|
||||
# Binary relations
|
||||
"\u00b1": "\\pm ",
|
||||
"\u2213": "\\mp ",
|
||||
# Italic, Latin, uppercase
|
||||
"\U0001d434": "A",
|
||||
"\U0001d435": "B",
|
||||
"\U0001d436": "C",
|
||||
"\U0001d437": "D",
|
||||
"\U0001d438": "E",
|
||||
"\U0001d439": "F",
|
||||
"\U0001d43a": "G",
|
||||
"\U0001d43b": "H",
|
||||
"\U0001d43c": "I",
|
||||
"\U0001d43d": "J",
|
||||
"\U0001d43e": "K",
|
||||
"\U0001d43f": "L",
|
||||
"\U0001d440": "M",
|
||||
"\U0001d441": "N",
|
||||
"\U0001d442": "O",
|
||||
"\U0001d443": "P",
|
||||
"\U0001d444": "Q",
|
||||
"\U0001d445": "R",
|
||||
"\U0001d446": "S",
|
||||
"\U0001d447": "T",
|
||||
"\U0001d448": "U",
|
||||
"\U0001d449": "V",
|
||||
"\U0001d44a": "W",
|
||||
"\U0001d44b": "X",
|
||||
"\U0001d44c": "Y",
|
||||
"\U0001d44d": "Z",
|
||||
# Italic, Latin, lowercase
|
||||
"\U0001d44e": "a",
|
||||
"\U0001d44f": "b",
|
||||
"\U0001d450": "c",
|
||||
"\U0001d451": "d",
|
||||
"\U0001d452": "e",
|
||||
"\U0001d453": "f",
|
||||
"\U0001d454": "g",
|
||||
"\U0001d456": "i",
|
||||
"\U0001d457": "j",
|
||||
"\U0001d458": "k",
|
||||
"\U0001d459": "l",
|
||||
"\U0001d45a": "m",
|
||||
"\U0001d45b": "n",
|
||||
"\U0001d45c": "o",
|
||||
"\U0001d45d": "p",
|
||||
"\U0001d45e": "q",
|
||||
"\U0001d45f": "r",
|
||||
"\U0001d460": "s",
|
||||
"\U0001d461": "t",
|
||||
"\U0001d462": "u",
|
||||
"\U0001d463": "v",
|
||||
"\U0001d464": "w",
|
||||
"\U0001d465": "x",
|
||||
"\U0001d466": "y",
|
||||
"\U0001d467": "z",
|
||||
}
|
||||
|
||||
FUNC = {
|
||||
"sin": "\\sin({fe})",
|
||||
"cos": "\\cos({fe})",
|
||||
"tan": "\\tan({fe})",
|
||||
"arcsin": "\\arcsin({fe})",
|
||||
"arccos": "\\arccos({fe})",
|
||||
"arctan": "\\arctan({fe})",
|
||||
"arccot": "\\arccot({fe})",
|
||||
"sinh": "\\sinh({fe})",
|
||||
"cosh": "\\cosh({fe})",
|
||||
"tanh": "\\tanh({fe})",
|
||||
"coth": "\\coth({fe})",
|
||||
"sec": "\\sec({fe})",
|
||||
"csc": "\\csc({fe})",
|
||||
}
|
||||
|
||||
FUNC_PLACE = "{fe}"
|
||||
|
||||
BRK = "\\\\"
|
||||
|
||||
CHR_DEFAULT = {
|
||||
"ACC_VAL": "\\hat{{{0}}}",
|
||||
}
|
||||
|
||||
POS = {
|
||||
"top": "\\overline{{{0}}}", # not sure
|
||||
"bot": "\\underline{{{0}}}",
|
||||
}
|
||||
|
||||
POS_DEFAULT = {
|
||||
"BAR_VAL": "\\overline{{{0}}}",
|
||||
}
|
||||
|
||||
SUB = "_{{{0}}}"
|
||||
|
||||
SUP = "^{{{0}}}"
|
||||
|
||||
F = {
|
||||
"bar": "\\frac{{{num}}}{{{den}}}",
|
||||
"skw": r"^{{{num}}}/_{{{den}}}",
|
||||
"noBar": "\\genfrac{{}}{{}}{{0pt}}{{}}{{{num}}}{{{den}}}",
|
||||
"lin": "{{{num}}}/{{{den}}}",
|
||||
}
|
||||
F_DEFAULT = "\\frac{{{num}}}{{{den}}}"
|
||||
|
||||
D = "\\left{left}{text}\\right{right}"
|
||||
|
||||
D_DEFAULT = {
|
||||
"left": "(",
|
||||
"right": ")",
|
||||
"null": ".",
|
||||
}
|
||||
|
||||
RAD = "\\sqrt[{deg}]{{{text}}}"
|
||||
RAD_DEFAULT = "\\sqrt{{{text}}}"
|
||||
ARR = "{text}"
|
||||
|
||||
LIM_FUNC = {
|
||||
"lim": "\\lim_{{{lim}}}",
|
||||
"max": "\\max_{{{lim}}}",
|
||||
"min": "\\min_{{{lim}}}",
|
||||
}
|
||||
|
||||
LIM_TO = ("\\rightarrow", "\\to")
|
||||
|
||||
LIM_UPP = "\\overset{{{lim}}}{{{text}}}"
|
||||
|
||||
M = "\\begin{{matrix}}{text}\\end{{matrix}}"
|
453
docling/backend/docx/latex/omml.py
Normal file
453
docling/backend/docx/latex/omml.py
Normal file
@ -0,0 +1,453 @@
|
||||
"""
|
||||
Office Math Markup Language (OMML)
|
||||
|
||||
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/omml.py
|
||||
On 23/01/2025
|
||||
"""
|
||||
|
||||
import lxml.etree as ET
|
||||
from pylatexenc.latexencode import UnicodeToLatexEncoder
|
||||
|
||||
from docling.backend.docx.latex.latex_dict import (
|
||||
ALN,
|
||||
ARR,
|
||||
BACKSLASH,
|
||||
BLANK,
|
||||
BRK,
|
||||
CHARS,
|
||||
CHR,
|
||||
CHR_BO,
|
||||
CHR_DEFAULT,
|
||||
D_DEFAULT,
|
||||
F_DEFAULT,
|
||||
FUNC,
|
||||
FUNC_PLACE,
|
||||
LIM_FUNC,
|
||||
LIM_TO,
|
||||
LIM_UPP,
|
||||
POS,
|
||||
POS_DEFAULT,
|
||||
RAD,
|
||||
RAD_DEFAULT,
|
||||
SUB,
|
||||
SUP,
|
||||
D,
|
||||
F,
|
||||
M,
|
||||
T,
|
||||
)
|
||||
|
||||
OMML_NS = "{http://schemas.openxmlformats.org/officeDocument/2006/math}"
|
||||
|
||||
|
||||
def load(stream):
|
||||
tree = ET.parse(stream)
|
||||
for omath in tree.findall(OMML_NS + "oMath"):
|
||||
yield oMath2Latex(omath)
|
||||
|
||||
|
||||
def load_string(string):
|
||||
root = ET.fromstring(string)
|
||||
for omath in root.findall(OMML_NS + "oMath"):
|
||||
yield oMath2Latex(omath)
|
||||
|
||||
|
||||
def escape_latex(strs):
|
||||
last = None
|
||||
new_chr = []
|
||||
strs = strs.replace(r"\\", "\\")
|
||||
for c in strs:
|
||||
if (c in CHARS) and (last != BACKSLASH):
|
||||
new_chr.append(BACKSLASH + c)
|
||||
else:
|
||||
new_chr.append(c)
|
||||
last = c
|
||||
return BLANK.join(new_chr)
|
||||
|
||||
|
||||
def get_val(key, default=None, store=CHR):
|
||||
if key is not None:
|
||||
return key if not store else store.get(key, key)
|
||||
else:
|
||||
return default
|
||||
|
||||
|
||||
class Tag2Method(object):
|
||||
|
||||
def call_method(self, elm, stag=None):
|
||||
getmethod = self.tag2meth.get
|
||||
if stag is None:
|
||||
stag = elm.tag.replace(OMML_NS, "")
|
||||
method = getmethod(stag)
|
||||
if method:
|
||||
return method(self, elm)
|
||||
else:
|
||||
return None
|
||||
|
||||
def process_children_list(self, elm, include=None):
|
||||
"""
|
||||
process children of the elm,return iterable
|
||||
"""
|
||||
for _e in list(elm):
|
||||
if OMML_NS not in _e.tag:
|
||||
continue
|
||||
stag = _e.tag.replace(OMML_NS, "")
|
||||
if include and (stag not in include):
|
||||
continue
|
||||
t = self.call_method(_e, stag=stag)
|
||||
if t is None:
|
||||
t = self.process_unknow(_e, stag)
|
||||
if t is None:
|
||||
continue
|
||||
yield (stag, t, _e)
|
||||
|
||||
def process_children_dict(self, elm, include=None):
|
||||
"""
|
||||
process children of the elm,return dict
|
||||
"""
|
||||
latex_chars = dict()
|
||||
for stag, t, e in self.process_children_list(elm, include):
|
||||
latex_chars[stag] = t
|
||||
return latex_chars
|
||||
|
||||
def process_children(self, elm, include=None):
|
||||
"""
|
||||
process children of the elm,return string
|
||||
"""
|
||||
return BLANK.join(
|
||||
(
|
||||
t if not isinstance(t, Tag2Method) else str(t)
|
||||
for stag, t, e in self.process_children_list(elm, include)
|
||||
)
|
||||
)
|
||||
|
||||
def process_unknow(self, elm, stag):
|
||||
return None
|
||||
|
||||
|
||||
class Pr(Tag2Method):
|
||||
|
||||
text = ""
|
||||
|
||||
__val_tags = ("chr", "pos", "begChr", "endChr", "type")
|
||||
|
||||
__innerdict = None # can't use the __dict__
|
||||
|
||||
""" common properties of element"""
|
||||
|
||||
def __init__(self, elm):
|
||||
self.__innerdict = {}
|
||||
self.text = self.process_children(elm)
|
||||
|
||||
def __str__(self):
|
||||
return self.text
|
||||
|
||||
def __unicode__(self):
|
||||
return self.__str__(self)
|
||||
|
||||
def __getattr__(self, name):
|
||||
return self.__innerdict.get(name, None)
|
||||
|
||||
def do_brk(self, elm):
|
||||
self.__innerdict["brk"] = BRK
|
||||
return BRK
|
||||
|
||||
def do_common(self, elm):
|
||||
stag = elm.tag.replace(OMML_NS, "")
|
||||
if stag in self.__val_tags:
|
||||
t = elm.get("{0}val".format(OMML_NS))
|
||||
self.__innerdict[stag] = t
|
||||
return None
|
||||
|
||||
tag2meth = {
|
||||
"brk": do_brk,
|
||||
"chr": do_common,
|
||||
"pos": do_common,
|
||||
"begChr": do_common,
|
||||
"endChr": do_common,
|
||||
"type": do_common,
|
||||
}
|
||||
|
||||
|
||||
class oMath2Latex(Tag2Method):
|
||||
"""
|
||||
Convert oMath element of omml to latex
|
||||
"""
|
||||
|
||||
_t_dict = T
|
||||
|
||||
__direct_tags = ("box", "sSub", "sSup", "sSubSup", "num", "den", "deg", "e")
|
||||
u = UnicodeToLatexEncoder(
|
||||
replacement_latex_protection="braces-all",
|
||||
unknown_char_policy="keep",
|
||||
unknown_char_warning=False,
|
||||
)
|
||||
|
||||
def __init__(self, element):
|
||||
self._latex = self.process_children(element)
|
||||
|
||||
def __str__(self):
|
||||
return self.latex.replace(" ", " ")
|
||||
|
||||
def __unicode__(self):
|
||||
return self.__str__(self)
|
||||
|
||||
def process_unknow(self, elm, stag):
|
||||
if stag in self.__direct_tags:
|
||||
return self.process_children(elm)
|
||||
elif stag[-2:] == "Pr":
|
||||
return Pr(elm)
|
||||
else:
|
||||
return None
|
||||
|
||||
@property
|
||||
def latex(self):
|
||||
return self._latex
|
||||
|
||||
def do_acc(self, elm):
|
||||
"""
|
||||
the accent function
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
latex_s = get_val(
|
||||
c_dict["accPr"].chr, default=CHR_DEFAULT.get("ACC_VAL"), store=CHR
|
||||
)
|
||||
return latex_s.format(c_dict["e"])
|
||||
|
||||
def do_bar(self, elm):
|
||||
"""
|
||||
the bar function
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
pr = c_dict["barPr"]
|
||||
latex_s = get_val(pr.pos, default=POS_DEFAULT.get("BAR_VAL"), store=POS)
|
||||
return pr.text + latex_s.format(c_dict["e"])
|
||||
|
||||
def do_d(self, elm):
|
||||
"""
|
||||
the delimiter object
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
pr = c_dict["dPr"]
|
||||
null = D_DEFAULT.get("null")
|
||||
|
||||
s_val = get_val(pr.begChr, default=D_DEFAULT.get("left"), store=T)
|
||||
e_val = get_val(pr.endChr, default=D_DEFAULT.get("right"), store=T)
|
||||
delim = pr.text + D.format(
|
||||
left=null if not s_val else escape_latex(s_val),
|
||||
text=c_dict["e"],
|
||||
right=null if not e_val else escape_latex(e_val),
|
||||
)
|
||||
return delim
|
||||
|
||||
def do_spre(self, elm):
|
||||
"""
|
||||
the Pre-Sub-Superscript object -- Not support yet
|
||||
"""
|
||||
pass
|
||||
|
||||
def do_sub(self, elm):
|
||||
text = self.process_children(elm)
|
||||
return SUB.format(text)
|
||||
|
||||
def do_sup(self, elm):
|
||||
text = self.process_children(elm)
|
||||
return SUP.format(text)
|
||||
|
||||
def do_f(self, elm):
|
||||
"""
|
||||
the fraction object
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
pr = c_dict["fPr"]
|
||||
latex_s = get_val(pr.type, default=F_DEFAULT, store=F)
|
||||
return pr.text + latex_s.format(num=c_dict.get("num"), den=c_dict.get("den"))
|
||||
|
||||
def do_func(self, elm):
|
||||
"""
|
||||
the Function-Apply object (Examples:sin cos)
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
func_name = c_dict.get("fName")
|
||||
return func_name.replace(FUNC_PLACE, c_dict.get("e"))
|
||||
|
||||
def do_fname(self, elm):
|
||||
"""
|
||||
the func name
|
||||
"""
|
||||
latex_chars = []
|
||||
for stag, t, e in self.process_children_list(elm):
|
||||
if stag == "r":
|
||||
if FUNC.get(t):
|
||||
latex_chars.append(FUNC[t])
|
||||
else:
|
||||
raise NotSupport("Not support func %s" % t)
|
||||
else:
|
||||
latex_chars.append(t)
|
||||
t = BLANK.join(latex_chars)
|
||||
return t if FUNC_PLACE in t else t + FUNC_PLACE # do_func will replace this
|
||||
|
||||
def do_groupchr(self, elm):
|
||||
"""
|
||||
the Group-Character object
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
pr = c_dict["groupChrPr"]
|
||||
latex_s = get_val(pr.chr)
|
||||
return pr.text + latex_s.format(c_dict["e"])
|
||||
|
||||
def do_rad(self, elm):
|
||||
"""
|
||||
the radical object
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
text = c_dict.get("e")
|
||||
deg_text = c_dict.get("deg")
|
||||
if deg_text:
|
||||
return RAD.format(deg=deg_text, text=text)
|
||||
else:
|
||||
return RAD_DEFAULT.format(text=text)
|
||||
|
||||
def do_eqarr(self, elm):
|
||||
"""
|
||||
the Array object
|
||||
"""
|
||||
return ARR.format(
|
||||
text=BRK.join(
|
||||
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
|
||||
)
|
||||
)
|
||||
|
||||
def do_limlow(self, elm):
|
||||
"""
|
||||
the Lower-Limit object
|
||||
"""
|
||||
t_dict = self.process_children_dict(elm, include=("e", "lim"))
|
||||
latex_s = LIM_FUNC.get(t_dict["e"])
|
||||
if not latex_s:
|
||||
raise NotSupport("Not support lim %s" % t_dict["e"])
|
||||
else:
|
||||
return latex_s.format(lim=t_dict.get("lim"))
|
||||
|
||||
def do_limupp(self, elm):
|
||||
"""
|
||||
the Upper-Limit object
|
||||
"""
|
||||
t_dict = self.process_children_dict(elm, include=("e", "lim"))
|
||||
return LIM_UPP.format(lim=t_dict.get("lim"), text=t_dict.get("e"))
|
||||
|
||||
def do_lim(self, elm):
|
||||
"""
|
||||
the lower limit of the limLow object and the upper limit of the limUpp function
|
||||
"""
|
||||
return self.process_children(elm).replace(LIM_TO[0], LIM_TO[1])
|
||||
|
||||
def do_m(self, elm):
|
||||
"""
|
||||
the Matrix object
|
||||
"""
|
||||
rows = []
|
||||
for stag, t, e in self.process_children_list(elm):
|
||||
if stag == "mPr":
|
||||
pass
|
||||
elif stag == "mr":
|
||||
rows.append(t)
|
||||
return M.format(text=BRK.join(rows))
|
||||
|
||||
def do_mr(self, elm):
|
||||
"""
|
||||
a single row of the matrix m
|
||||
"""
|
||||
return ALN.join(
|
||||
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
|
||||
)
|
||||
|
||||
def do_nary(self, elm):
|
||||
"""
|
||||
the n-ary object
|
||||
"""
|
||||
res = []
|
||||
bo = ""
|
||||
for stag, t, e in self.process_children_list(elm):
|
||||
if stag == "naryPr":
|
||||
bo = get_val(t.chr, store=CHR_BO)
|
||||
else:
|
||||
res.append(t)
|
||||
return bo + BLANK.join(res)
|
||||
|
||||
def process_unicode(self, s):
|
||||
# s = s if isinstance(s,unicode) else unicode(s,'utf-8')
|
||||
# print(s, self._t_dict.get(s, s), unicode_to_latex(s))
|
||||
# _str.append( self._t_dict.get(s, s) )
|
||||
|
||||
out_latex_str = self.u.unicode_to_latex(s)
|
||||
|
||||
# print(s, out_latex_str)
|
||||
|
||||
if (
|
||||
s.startswith("{") is False
|
||||
and out_latex_str.startswith("{")
|
||||
and s.endswith("}") is False
|
||||
and out_latex_str.endswith("}")
|
||||
):
|
||||
out_latex_str = f" {out_latex_str[1:-1]} "
|
||||
|
||||
# print(s, out_latex_str)
|
||||
|
||||
if "ensuremath" in out_latex_str:
|
||||
out_latex_str = out_latex_str.replace("\\ensuremath{", " ")
|
||||
out_latex_str = out_latex_str.replace("}", " ")
|
||||
|
||||
# print(s, out_latex_str)
|
||||
|
||||
if out_latex_str.strip().startswith("\\text"):
|
||||
out_latex_str = f" \\text{{{out_latex_str}}} "
|
||||
|
||||
# print(s, out_latex_str)
|
||||
|
||||
return out_latex_str
|
||||
|
||||
def do_r(self, elm):
|
||||
"""
|
||||
Get text from 'r' element,And try convert them to latex symbols
|
||||
@todo text style support , (sty)
|
||||
@todo \text (latex pure text support)
|
||||
"""
|
||||
_str = []
|
||||
_base_str = []
|
||||
for s in elm.findtext("./{0}t".format(OMML_NS)):
|
||||
out_latex_str = self.process_unicode(s)
|
||||
_str.append(out_latex_str)
|
||||
_base_str.append(s)
|
||||
|
||||
proc_str = escape_latex(BLANK.join(_str))
|
||||
base_proc_str = BLANK.join(_base_str)
|
||||
|
||||
if "{" not in base_proc_str and "\\{" in proc_str:
|
||||
proc_str = proc_str.replace("\\{", "{")
|
||||
|
||||
if "}" not in base_proc_str and "\\}" in proc_str:
|
||||
proc_str = proc_str.replace("\\}", "}")
|
||||
|
||||
return proc_str
|
||||
|
||||
tag2meth = {
|
||||
"acc": do_acc,
|
||||
"r": do_r,
|
||||
"bar": do_bar,
|
||||
"sub": do_sub,
|
||||
"sup": do_sup,
|
||||
"f": do_f,
|
||||
"func": do_func,
|
||||
"fName": do_fname,
|
||||
"groupChr": do_groupchr,
|
||||
"d": do_d,
|
||||
"rad": do_rad,
|
||||
"eqArr": do_eqarr,
|
||||
"limLow": do_limlow,
|
||||
"limUpp": do_limupp,
|
||||
"lim": do_lim,
|
||||
"m": do_m,
|
||||
"mr": do_mr,
|
||||
"nary": do_nary,
|
||||
}
|
@ -139,7 +139,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
||||
self.analyze_tag(cast(Tag, element), doc)
|
||||
except Exception as exc_child:
|
||||
_log.error(
|
||||
f"Error processing child from tag{tag.name}: {exc_child}"
|
||||
f"Error processing child from tag {tag.name}: {repr(exc_child)}"
|
||||
)
|
||||
raise exc_child
|
||||
elif isinstance(element, NavigableString) and not isinstance(
|
||||
@ -391,11 +391,11 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
||||
content_layer=self.content_layer,
|
||||
)
|
||||
self.level += 1
|
||||
|
||||
self.walk(element, doc)
|
||||
|
||||
self.parents[self.level + 1] = None
|
||||
self.level -= 1
|
||||
self.walk(element, doc)
|
||||
self.parents[self.level + 1] = None
|
||||
self.level -= 1
|
||||
else:
|
||||
self.walk(element, doc)
|
||||
|
||||
elif element.text.strip():
|
||||
text = element.text.strip()
|
||||
@ -501,7 +501,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
||||
end_row_offset_idx=row_idx + row_span,
|
||||
start_col_offset_idx=col_idx,
|
||||
end_col_offset_idx=col_idx + col_span,
|
||||
col_header=col_header,
|
||||
column_header=col_header,
|
||||
row_header=((not col_header) and html_cell.name == "th"),
|
||||
)
|
||||
data.table_cells.append(table_cell)
|
||||
|
@ -136,7 +136,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||
end_row_offset_idx=trow_ind + row_span,
|
||||
start_col_offset_idx=tcol_ind,
|
||||
end_col_offset_idx=tcol_ind + col_span,
|
||||
col_header=False,
|
||||
column_header=trow_ind == 0,
|
||||
row_header=False,
|
||||
)
|
||||
tcells.append(icell)
|
||||
|
@ -164,7 +164,7 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
||||
end_row_offset_idx=excel_cell.row + excel_cell.row_span,
|
||||
start_col_offset_idx=excel_cell.col,
|
||||
end_col_offset_idx=excel_cell.col + excel_cell.col_span,
|
||||
col_header=False,
|
||||
column_header=excel_cell.row == 0,
|
||||
row_header=False,
|
||||
)
|
||||
table_data.table_cells.append(cell)
|
||||
@ -173,7 +173,7 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
||||
|
||||
return doc
|
||||
|
||||
def _find_data_tables(self, sheet: Worksheet):
|
||||
def _find_data_tables(self, sheet: Worksheet) -> List[ExcelTable]:
|
||||
"""
|
||||
Find all compact rectangular data tables in a sheet.
|
||||
"""
|
||||
@ -340,47 +340,4 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
||||
except:
|
||||
_log.error("could not extract the image from excel sheets")
|
||||
|
||||
"""
|
||||
for idx, chart in enumerate(sheet._charts): # type: ignore
|
||||
try:
|
||||
chart_path = f"chart_{idx + 1}.png"
|
||||
_log.info(
|
||||
f"Chart found, but dynamic rendering is required for: {chart_path}"
|
||||
)
|
||||
|
||||
_log.info(f"Chart {idx + 1}:")
|
||||
|
||||
# Chart type
|
||||
# _log.info(f"Type: {type(chart).__name__}")
|
||||
print(f"Type: {type(chart).__name__}")
|
||||
|
||||
# Extract series data
|
||||
for series_idx, series in enumerate(chart.series):
|
||||
#_log.info(f"Series {series_idx + 1}:")
|
||||
print(f"Series {series_idx + 1} type: {type(series).__name__}")
|
||||
#print(f"x-values: {series.xVal}")
|
||||
#print(f"y-values: {series.yVal}")
|
||||
|
||||
print(f"xval type: {type(series.xVal).__name__}")
|
||||
|
||||
xvals = []
|
||||
for _ in series.xVal.numLit.pt:
|
||||
print(f"xval type: {type(_).__name__}")
|
||||
if hasattr(_, 'v'):
|
||||
xvals.append(_.v)
|
||||
|
||||
print(f"x-values: {xvals}")
|
||||
|
||||
yvals = []
|
||||
for _ in series.yVal:
|
||||
if hasattr(_, 'v'):
|
||||
yvals.append(_.v)
|
||||
|
||||
print(f"y-values: {yvals}")
|
||||
|
||||
except Exception as exc:
|
||||
print(exc)
|
||||
continue
|
||||
"""
|
||||
|
||||
return doc
|
||||
|
@ -346,7 +346,7 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
||||
end_row_offset_idx=row_idx + row_span,
|
||||
start_col_offset_idx=col_idx,
|
||||
end_col_offset_idx=col_idx + col_span,
|
||||
col_header=False,
|
||||
column_header=row_idx == 0,
|
||||
row_header=False,
|
||||
)
|
||||
if len(cell.text.strip()) > 0:
|
||||
|
@ -26,6 +26,7 @@ from PIL import Image, UnidentifiedImageError
|
||||
from typing_extensions import override
|
||||
|
||||
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
||||
from docling.backend.docx.latex.omml import oMath2Latex
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.document import InputDocument
|
||||
|
||||
@ -260,6 +261,25 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
else:
|
||||
return label, None
|
||||
|
||||
def handle_equations_in_text(self, element, text):
|
||||
only_texts = []
|
||||
only_equations = []
|
||||
texts_and_equations = []
|
||||
for subt in element.iter():
|
||||
tag_name = etree.QName(subt).localname
|
||||
if tag_name == "t" and "math" not in subt.tag:
|
||||
only_texts.append(subt.text)
|
||||
texts_and_equations.append(subt.text)
|
||||
elif "oMath" in subt.tag and "oMathPara" not in subt.tag:
|
||||
latex_equation = str(oMath2Latex(subt))
|
||||
only_equations.append(latex_equation)
|
||||
texts_and_equations.append(latex_equation)
|
||||
|
||||
if "".join(only_texts) != text:
|
||||
return text
|
||||
|
||||
return "".join(texts_and_equations), only_equations
|
||||
|
||||
def handle_text_elements(
|
||||
self,
|
||||
element: BaseOxmlElement,
|
||||
@ -268,9 +288,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
) -> None:
|
||||
paragraph = Paragraph(element, docx_obj)
|
||||
|
||||
if paragraph.text is None:
|
||||
raw_text = paragraph.text
|
||||
text, equations = self.handle_equations_in_text(element=element, text=raw_text)
|
||||
|
||||
if text is None:
|
||||
return
|
||||
text = paragraph.text.strip()
|
||||
text = text.strip()
|
||||
|
||||
# Common styles for bullet and numbered lists.
|
||||
# "List Bullet", "List Number", "List Paragraph"
|
||||
@ -323,6 +346,45 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
elif "Heading" in p_style_id:
|
||||
self.add_header(doc, p_level, text)
|
||||
|
||||
elif len(equations) > 0:
|
||||
if (raw_text is None or len(raw_text) == 0) and len(text) > 0:
|
||||
# Standalone equation
|
||||
level = self.get_level()
|
||||
doc.add_text(
|
||||
label=DocItemLabel.FORMULA,
|
||||
parent=self.parents[level - 1],
|
||||
text=text,
|
||||
)
|
||||
else:
|
||||
# Inline equation
|
||||
level = self.get_level()
|
||||
inline_equation = doc.add_group(
|
||||
label=GroupLabel.INLINE, parent=self.parents[level - 1]
|
||||
)
|
||||
text_tmp = text
|
||||
for eq in equations:
|
||||
if len(text_tmp) == 0:
|
||||
break
|
||||
pre_eq_text = text_tmp.split(eq, maxsplit=1)[0]
|
||||
text_tmp = text_tmp.split(eq, maxsplit=1)[1]
|
||||
if len(pre_eq_text) > 0:
|
||||
doc.add_text(
|
||||
label=DocItemLabel.PARAGRAPH,
|
||||
parent=inline_equation,
|
||||
text=pre_eq_text,
|
||||
)
|
||||
doc.add_text(
|
||||
label=DocItemLabel.FORMULA,
|
||||
parent=inline_equation,
|
||||
text=eq,
|
||||
)
|
||||
if len(text_tmp) > 0:
|
||||
doc.add_text(
|
||||
label=DocItemLabel.PARAGRAPH,
|
||||
parent=inline_equation,
|
||||
text=text_tmp,
|
||||
)
|
||||
|
||||
elif p_style_id in [
|
||||
"Paragraph",
|
||||
"Normal",
|
||||
@ -539,7 +601,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
end_row_offset_idx=row.grid_cols_before + spanned_idx,
|
||||
start_col_offset_idx=col_idx,
|
||||
end_col_offset_idx=col_idx + cell.grid_span,
|
||||
col_header=False,
|
||||
column_header=row.grid_cols_before + row_idx == 0,
|
||||
row_header=False,
|
||||
)
|
||||
data.table_cells.append(table_cell)
|
||||
|
@ -999,7 +999,7 @@ class PatentUsptoGrantAps(PatentUspto):
|
||||
parent=self.parents[self.level],
|
||||
)
|
||||
|
||||
last_claim.text += f" {value}" if last_claim.text else value
|
||||
last_claim.text += f" {value.strip()}" if last_claim.text else value.strip()
|
||||
|
||||
elif field == self.Field.CAPTION.value and section in (
|
||||
self.Section.SUMMARY.value,
|
||||
|
@ -210,7 +210,7 @@ def convert(
|
||||
table_mode: Annotated[
|
||||
TableFormerMode,
|
||||
typer.Option(..., help="The mode to use in the table structure model."),
|
||||
] = TableFormerMode.FAST,
|
||||
] = TableFormerMode.ACCURATE,
|
||||
enrich_code: Annotated[
|
||||
bool,
|
||||
typer.Option(..., help="Enable the code enrichment model in the pipeline."),
|
||||
|
@ -121,7 +121,7 @@ def download(
|
||||
"Using the CLI:",
|
||||
f"`docling --artifacts-path={output_dir} FILE`",
|
||||
"\n",
|
||||
"Using Python: see the documentation at <https://ds4sd.github.io/docling/usage>.",
|
||||
"Using Python: see the documentation at <https://docling-project.github.io/docling/usage>.",
|
||||
)
|
||||
|
||||
|
||||
|
@ -99,7 +99,7 @@ class TableStructureOptions(BaseModel):
|
||||
# are merged across table columns.
|
||||
# False: Let table structure model define the text cells, ignore PDF cells.
|
||||
)
|
||||
mode: TableFormerMode = TableFormerMode.FAST
|
||||
mode: TableFormerMode = TableFormerMode.ACCURATE
|
||||
|
||||
|
||||
class OcrOptions(BaseModel):
|
||||
|
@ -1,4 +1,5 @@
|
||||
import re
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from typing import Iterable, List, Literal, Optional, Tuple, Union
|
||||
|
||||
@ -11,7 +12,7 @@ from docling_core.types.doc import (
|
||||
TextItem,
|
||||
)
|
||||
from docling_core.types.doc.labels import CodeLanguageLabel
|
||||
from PIL import Image
|
||||
from PIL import Image, ImageOps
|
||||
from pydantic import BaseModel
|
||||
|
||||
from docling.datamodel.base_models import ItemAndImageEnrichmentElement
|
||||
@ -65,7 +66,7 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
|
||||
_model_repo_folder = "ds4sd--CodeFormula"
|
||||
elements_batch_size = 5
|
||||
images_scale = 1.66 # = 120 dpi, aligned with training data resolution
|
||||
expansion_factor = 0.03
|
||||
expansion_factor = 0.18
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
@ -124,7 +125,7 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
|
||||
repo_id="ds4sd/CodeFormula",
|
||||
force_download=force,
|
||||
local_dir=local_dir,
|
||||
revision="v1.0.1",
|
||||
revision="v1.0.2",
|
||||
)
|
||||
|
||||
return Path(download_path)
|
||||
@ -175,7 +176,7 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
|
||||
- The second element is the extracted language if a match is found;
|
||||
otherwise, `None`.
|
||||
"""
|
||||
pattern = r"^<_([^>]+)_>\s*(.*)"
|
||||
pattern = r"^<_([^_>]+)_>\s(.*)"
|
||||
match = re.match(pattern, input_string, flags=re.DOTALL)
|
||||
if match:
|
||||
language = str(match.group(1)) # the captured programming language
|
||||
@ -206,6 +207,82 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
|
||||
except ValueError:
|
||||
return CodeLanguageLabel.UNKNOWN
|
||||
|
||||
def _get_most_frequent_edge_color(self, pil_img: Image.Image):
|
||||
"""
|
||||
Compute the most frequent color along the outer edges of a PIL image.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
pil_img : Image.Image
|
||||
A PIL Image in any mode (L, RGB, RGBA, etc.).
|
||||
|
||||
Returns
|
||||
-------
|
||||
(int) or (tuple): The most common edge color as a scalar (for grayscale) or
|
||||
tuple (for RGB/RGBA).
|
||||
"""
|
||||
# Convert to NumPy array for easy pixel access
|
||||
img_np = np.array(pil_img)
|
||||
|
||||
if img_np.ndim == 2:
|
||||
# Grayscale-like image: shape (H, W)
|
||||
# Extract edges: top row, bottom row, left col, right col
|
||||
top = img_np[0, :] # shape (W,)
|
||||
bottom = img_np[-1, :] # shape (W,)
|
||||
left = img_np[:, 0] # shape (H,)
|
||||
right = img_np[:, -1] # shape (H,)
|
||||
|
||||
# Concatenate all edges
|
||||
edges = np.concatenate([top, bottom, left, right])
|
||||
|
||||
# Count frequencies
|
||||
freq = Counter(edges.tolist())
|
||||
most_common_value, _ = freq.most_common(1)[0]
|
||||
return int(most_common_value) # single channel color
|
||||
|
||||
else:
|
||||
# Color image: shape (H, W, C)
|
||||
top = img_np[0, :, :] # shape (W, C)
|
||||
bottom = img_np[-1, :, :] # shape (W, C)
|
||||
left = img_np[:, 0, :] # shape (H, C)
|
||||
right = img_np[:, -1, :] # shape (H, C)
|
||||
|
||||
# Concatenate edges along first axis
|
||||
edges = np.concatenate([top, bottom, left, right], axis=0)
|
||||
|
||||
# Convert each color to a tuple for counting
|
||||
edges_as_tuples = [tuple(pixel) for pixel in edges]
|
||||
freq = Counter(edges_as_tuples)
|
||||
most_common_value, _ = freq.most_common(1)[0]
|
||||
return most_common_value # e.g. (R, G, B) or (R, G, B, A)
|
||||
|
||||
def _pad_with_most_frequent_edge_color(
|
||||
self, img: Union[Image.Image, np.ndarray], padding: Tuple[int, int, int, int]
|
||||
):
|
||||
"""
|
||||
Pads an image (PIL or NumPy array) using the most frequent edge color.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
img : Union[Image.Image, np.ndarray]
|
||||
The original image.
|
||||
padding : tuple
|
||||
Padding (left, top, right, bottom) in pixels.
|
||||
|
||||
Returns
|
||||
-------
|
||||
Image.Image: A new PIL image with the specified padding.
|
||||
"""
|
||||
if isinstance(img, np.ndarray):
|
||||
pil_img = Image.fromarray(img)
|
||||
else:
|
||||
pil_img = img
|
||||
|
||||
most_freq_color = self._get_most_frequent_edge_color(pil_img)
|
||||
|
||||
padded_img = ImageOps.expand(pil_img, border=padding, fill=most_freq_color)
|
||||
return padded_img
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
doc: DoclingDocument,
|
||||
@ -238,7 +315,9 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
|
||||
assert isinstance(el.item, TextItem)
|
||||
elements.append(el.item)
|
||||
labels.append(el.item.label)
|
||||
images.append(el.image)
|
||||
images.append(
|
||||
self._pad_with_most_frequent_edge_color(el.image, (20, 10, 20, 10))
|
||||
)
|
||||
|
||||
outputs = self.code_formula_model.predict(images, labels)
|
||||
|
||||
|
@ -113,7 +113,7 @@ class DocumentPictureClassifier(BaseEnrichmentModel):
|
||||
repo_id="ds4sd/DocumentFigureClassifier",
|
||||
force_download=force,
|
||||
local_dir=local_dir,
|
||||
revision="v1.0.0",
|
||||
revision="v1.0.1",
|
||||
)
|
||||
|
||||
return Path(download_path)
|
||||
|
@ -26,7 +26,7 @@ class OcrMacModel(BaseOcrModel):
|
||||
"ocrmac is not correctly installed. "
|
||||
"Please install it via `pip install ocrmac` to use this OCR engine. "
|
||||
"Alternatively, Docling has support for other OCR engines. See the documentation: "
|
||||
"https://ds4sd.github.io/docling/installation/"
|
||||
"https://docling-project.github.io/docling/installation/"
|
||||
)
|
||||
try:
|
||||
from ocrmac import ocrmac
|
||||
|
@ -95,7 +95,7 @@ class TableStructureModel(BasePageModel):
|
||||
repo_id="ds4sd/docling-models",
|
||||
force_download=force,
|
||||
local_dir=local_dir,
|
||||
revision="v2.1.0",
|
||||
revision="v2.2.0",
|
||||
)
|
||||
|
||||
return Path(download_path)
|
||||
|
@ -31,14 +31,14 @@ class TesseractOcrModel(BaseOcrModel):
|
||||
"Note that tesserocr might have to be manually compiled for working with "
|
||||
"your Tesseract installation. The Docling documentation provides examples for it. "
|
||||
"Alternatively, Docling has support for other OCR engines. See the documentation: "
|
||||
"https://ds4sd.github.io/docling/installation/"
|
||||
"https://docling-project.github.io/docling/installation/"
|
||||
)
|
||||
missing_langs_errmsg = (
|
||||
"tesserocr is not correctly configured. No language models have been detected. "
|
||||
"Please ensure that the TESSDATA_PREFIX envvar points to tesseract languages dir. "
|
||||
"You can find more information how to setup other OCR engines in Docling "
|
||||
"documentation: "
|
||||
"https://ds4sd.github.io/docling/installation/"
|
||||
"https://docling-project.github.io/docling/installation/"
|
||||
)
|
||||
|
||||
try:
|
||||
|
@ -7,7 +7,7 @@ pydantic datatype, which can express several features common to documents, such
|
||||
* Layout information (i.e. bounding boxes) for all items, if available
|
||||
* Provenance information
|
||||
|
||||
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/DS4SD/docling-core/tree/main/docling_core/types/doc).
|
||||
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/docling-project/docling-core/tree/main/docling_core/types/doc).
|
||||
|
||||
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
|
||||
|
||||
|
@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -36,7 +36,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This is an example of using [Docling](https://ds4sd.github.io/docling/) for converting structured data (XML) into a unified document\n",
|
||||
"This is an example of using [Docling](https://docling-project.github.io/docling/) for converting structured data (XML) into a unified document\n",
|
||||
"representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n",
|
||||
"\n",
|
||||
"Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n",
|
||||
|
@ -103,7 +103,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://ds4sd.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
|
||||
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -321,7 +321,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "docling-aMWN2FRM-py3.12",
|
||||
"display_name": "docling-hgXEfXco-py3.12",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
|
@ -36,7 +36,7 @@
|
||||
"## A recipe 🧑🍳 🐥 💚\n",
|
||||
"\n",
|
||||
"This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:\n",
|
||||
"- [Docling](https://ds4sd.github.io/docling/) for document parsing and chunking\n",
|
||||
"- [Docling](https://docling-project.github.io/docling/) for document parsing and chunking\n",
|
||||
"- [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search/?msockid=0109678bea39665431e37323ebff6723) for vector indexing and retrieval\n",
|
||||
"- [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service?msockid=0109678bea39665431e37323ebff6723) for embeddings and chat completion\n",
|
||||
"\n",
|
||||
|
@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -247,7 +247,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
|
||||
"/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
|
||||
" warnings.warn(\n"
|
||||
]
|
||||
}
|
||||
|
@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -168,7 +168,7 @@
|
||||
"source": [
|
||||
"> Note: a message saying `\"Token indices sequence length is longer than the specified\n",
|
||||
"maximum sequence length...\"` can be ignored in this case — details\n",
|
||||
"[here](https://github.com/DS4SD/docling-core/issues/119#issuecomment-2577418826)."
|
||||
"[here](https://github.com/docling-project/docling-core/issues/119#issuecomment-2577418826)."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"[](https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
|
||||
"[](https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -29,7 +29,7 @@
|
||||
"\n",
|
||||
"## A recipe 🧑🍳 🐥 💚\n",
|
||||
"\n",
|
||||
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://ds4sd.github.io/docling/).\n",
|
||||
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://docling-project.github.io/docling/).\n",
|
||||
"\n",
|
||||
"In this notebook, we accomplish the following:\n",
|
||||
"* Parse the top machine learning papers on [arXiv](https://arxiv.org/) using Docling\n",
|
||||
|
@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
|
||||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
|
||||
".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
@ -109,7 +109,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
|
||||
"/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
|
||||
" warnings.warn(\n"
|
||||
]
|
||||
}
|
||||
|
@ -1,6 +1,6 @@
|
||||
# FAQ
|
||||
|
||||
This is a collection of FAQ collected from the user questions on <https://github.com/DS4SD/docling/discussions>.
|
||||
This is a collection of FAQ collected from the user questions on <https://github.com/docling-project/docling/discussions>.
|
||||
|
||||
|
||||
??? question "Is Python 3.13 supported?"
|
||||
@ -41,7 +41,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
||||
]
|
||||
```
|
||||
|
||||
Source: Issue [#283](https://github.com/DS4SD/docling/issues/283#issuecomment-2465035868)
|
||||
Source: Issue [#283](https://github.com/docling-project/docling/issues/283#issuecomment-2465035868)
|
||||
|
||||
|
||||
??? question "Are text styles (bold, underline, etc) supported?"
|
||||
@ -74,7 +74,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
||||
)
|
||||
```
|
||||
|
||||
Source: Issue [#326](https://github.com/DS4SD/docling/issues/326)
|
||||
Source: Issue [#326](https://github.com/docling-project/docling/issues/326)
|
||||
|
||||
|
||||
??? question " Which model weights are needed to run Docling?"
|
||||
@ -84,7 +84,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
||||
|
||||
For processing PDF documents, Docling requires the model weights from <https://huggingface.co/ds4sd/docling-models>.
|
||||
|
||||
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/DS4SD/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
|
||||
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/docling-project/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
|
||||
|
||||
|
||||
??? question "SSL error downloading model weights"
|
||||
@ -174,6 +174,6 @@ This is a collection of FAQ collected from the user questions on <https://github
|
||||
print(f"Model max length: {tokenizer.model_max_length}")
|
||||
```
|
||||
|
||||
Also see [docling#725](https://github.com/DS4SD/docling/issues/725).
|
||||
Also see [docling#725](https://github.com/docling-project/docling/issues/725).
|
||||
|
||||
Source: Issue [docling-core#119](https://github.com/DS4SD/docling-core/issues/119)
|
||||
Source: Issue [docling-core#119](https://github.com/docling-project/docling-core/issues/119)
|
||||
|
@ -11,7 +11,7 @@
|
||||
[](https://pycqa.github.io/isort/)
|
||||
[](https://pydantic.dev)
|
||||
[](https://github.com/pre-commit/pre-commit)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://pepy.tech/projects/docling)
|
||||
|
||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||
|
@ -5,7 +5,7 @@ Docling is available as a converter in [Haystack](https://haystack.deepset.ai/):
|
||||
- 🧑🏽🍳 [Docling Haystack integration example][example]
|
||||
- 📦 [Docling Haystack integration PyPI][pypi]
|
||||
|
||||
[github]: https://github.com/DS4SD/docling-haystack
|
||||
[github]: https://github.com/docling-project/docling-haystack
|
||||
[docs]: https://haystack.deepset.ai/integrations/docling
|
||||
[pypi]: https://pypi.org/project/docling-haystack
|
||||
[example]: ../examples/rag_haystack.ipynb
|
||||
|
@ -8,7 +8,7 @@ To get started, check out the [step-by-step guide in LangChain][guide].
|
||||
- 📦 [LangChain Docling integration PyPI][pypi]
|
||||
|
||||
[docs]: https://python.langchain.com/docs/integrations/providers/docling/
|
||||
[github]: https://github.com/DS4SD/docling-langchain
|
||||
[github]: https://github.com/docling-project/docling-langchain
|
||||
[guide]: https://python.langchain.com/docs/integrations/document_loaders/docling/
|
||||
[example]: ../examples/rag_langchain.ipynb
|
||||
[pypi]: https://pypi.org/project/langchain-docling/
|
||||
|
@ -135,7 +135,7 @@ doc_converter = DocumentConverter(
|
||||
)
|
||||
```
|
||||
|
||||
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
|
||||
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.
|
||||
|
||||
```python
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
|
@ -1,7 +1,7 @@
|
||||
site_name: Docling
|
||||
site_url: https://ds4sd.github.io/docling/
|
||||
repo_name: DS4SD/docling
|
||||
repo_url: https://github.com/DS4SD/docling
|
||||
site_url: https://docling-project.github.io/docling/
|
||||
repo_name: docling-project/docling
|
||||
repo_url: https://github.com/docling-project/docling
|
||||
|
||||
theme:
|
||||
name: material
|
||||
|
150
poetry.lock
generated
150
poetry.lock
generated
@ -1,4 +1,4 @@
|
||||
# This file is automatically @generated by Poetry 1.8.4 and should not be changed by hand.
|
||||
# This file is automatically @generated by Poetry 1.8.5 and should not be changed by hand.
|
||||
|
||||
[[package]]
|
||||
name = "accelerate"
|
||||
@ -33,13 +33,13 @@ testing = ["bitsandbytes", "datasets", "diffusers", "evaluate", "parameterized",
|
||||
|
||||
[[package]]
|
||||
name = "aiohappyeyeballs"
|
||||
version = "2.4.6"
|
||||
version = "2.4.8"
|
||||
description = "Happy Eyeballs for asyncio"
|
||||
optional = false
|
||||
python-versions = ">=3.9"
|
||||
files = [
|
||||
{file = "aiohappyeyeballs-2.4.6-py3-none-any.whl", hash = "sha256:147ec992cf873d74f5062644332c539fcd42956dc69453fe5204195e560517e1"},
|
||||
{file = "aiohappyeyeballs-2.4.6.tar.gz", hash = "sha256:9b05052f9042985d32ecbe4b59a77ae19c006a78f1344d7fdad69d28ded3d0b0"},
|
||||
{file = "aiohappyeyeballs-2.4.8-py3-none-any.whl", hash = "sha256:6cac4f5dd6e34a9644e69cf9021ef679e4394f54e58a183056d12009e42ea9e3"},
|
||||
{file = "aiohappyeyeballs-2.4.8.tar.gz", hash = "sha256:19728772cb12263077982d2f55453babd8bec6a052a926cd5c0c42796da8bf62"},
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@ -218,8 +218,8 @@ files = [
|
||||
lazy-object-proxy = ">=1.4.0"
|
||||
typing-extensions = {version = ">=4.0.0", markers = "python_version < \"3.11\""}
|
||||
wrapt = [
|
||||
{version = ">=1.14,<2", markers = "python_version >= \"3.11\""},
|
||||
{version = ">=1.11,<2", markers = "python_version < \"3.11\""},
|
||||
{version = ">=1.14,<2", markers = "python_version >= \"3.11\""},
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@ -311,6 +311,24 @@ files = [
|
||||
docs = ["furo", "jaraco.packaging (>=9.3)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"]
|
||||
testing = ["jaraco.test", "pytest (!=8.0.*)", "pytest (>=6,!=8.1.*)", "pytest-checkdocs (>=2.4)", "pytest-cov", "pytest-enabler (>=2.2)"]
|
||||
|
||||
[[package]]
|
||||
name = "backrefs"
|
||||
version = "5.8"
|
||||
description = "A wrapper around re and regex that adds additional back references."
|
||||
optional = false
|
||||
python-versions = ">=3.9"
|
||||
files = [
|
||||
{file = "backrefs-5.8-py310-none-any.whl", hash = "sha256:c67f6638a34a5b8730812f5101376f9d41dc38c43f1fdc35cb54700f6ed4465d"},
|
||||
{file = "backrefs-5.8-py311-none-any.whl", hash = "sha256:2e1c15e4af0e12e45c8701bd5da0902d326b2e200cafcd25e49d9f06d44bb61b"},
|
||||
{file = "backrefs-5.8-py312-none-any.whl", hash = "sha256:bbef7169a33811080d67cdf1538c8289f76f0942ff971222a16034da88a73486"},
|
||||
{file = "backrefs-5.8-py313-none-any.whl", hash = "sha256:e3a63b073867dbefd0536425f43db618578528e3896fb77be7141328642a1585"},
|
||||
{file = "backrefs-5.8-py39-none-any.whl", hash = "sha256:a66851e4533fb5b371aa0628e1fee1af05135616b86140c9d787a2ffdf4b8fdc"},
|
||||
{file = "backrefs-5.8.tar.gz", hash = "sha256:2cab642a205ce966af3dd4b38ee36009b31fa9502a35fd61d59ccc116e40a6bd"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
extras = ["regex"]
|
||||
|
||||
[[package]]
|
||||
name = "beautifulsoup4"
|
||||
version = "4.13.3"
|
||||
@ -852,13 +870,13 @@ files = [
|
||||
|
||||
[[package]]
|
||||
name = "docling-core"
|
||||
version = "2.20.0"
|
||||
version = "2.22.0"
|
||||
description = "A python library to define and validate data types in Docling."
|
||||
optional = false
|
||||
python-versions = "<4.0,>=3.9"
|
||||
files = [
|
||||
{file = "docling_core-2.20.0-py3-none-any.whl", hash = "sha256:72f50fce277b7bb51f4134f443240c041582184305c3bcaabdea13fc5550f160"},
|
||||
{file = "docling_core-2.20.0.tar.gz", hash = "sha256:9733581c15f5a9b5e3a6cb74fa995cc4078ff16668007f86c5f75d1ea9180d7f"},
|
||||
{file = "docling_core-2.22.0-py3-none-any.whl", hash = "sha256:d74d351024d016f46a09f171fb9d2d78809b132e18e25176af517ac4203c858c"},
|
||||
{file = "docling_core-2.22.0.tar.gz", hash = "sha256:5e4bf15884560a5dc66482206f875d152701bb809f0ed52bbbe86133e0d559e2"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
@ -880,13 +898,13 @@ chunking = ["semchunk (>=2.2.0,<3.0.0)", "transformers (>=4.34.0,<5.0.0)"]
|
||||
|
||||
[[package]]
|
||||
name = "docling-ibm-models"
|
||||
version = "3.4.0"
|
||||
version = "3.4.1"
|
||||
description = "This package contains the AI models used by the Docling PDF conversion package"
|
||||
optional = false
|
||||
python-versions = "<4.0,>=3.9"
|
||||
files = [
|
||||
{file = "docling_ibm_models-3.4.0-py3-none-any.whl", hash = "sha256:186517ff1f76e76113600fa1e5a699927325081a8013fdd5d0551121c2e34190"},
|
||||
{file = "docling_ibm_models-3.4.0.tar.gz", hash = "sha256:fb79beeb07d1bb9bc8acf9d0a44643cd7ce1910aa418cd685e2e477b13eeafee"},
|
||||
{file = "docling_ibm_models-3.4.1-py3-none-any.whl", hash = "sha256:c3582c99dddfa3f0eafcf80cf1267fd8efa39c4a74cc7a88f9dd49684fac2986"},
|
||||
{file = "docling_ibm_models-3.4.1.tar.gz", hash = "sha256:093b4dff2ea284a4953c3aa009e29945208b8d389b94fb14940a03a93f673e96"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
@ -1331,13 +1349,13 @@ test = ["coverage[toml]", "ddt (>=1.1.1,!=1.4.3)", "mock", "mypy", "pre-commit",
|
||||
|
||||
[[package]]
|
||||
name = "griffe"
|
||||
version = "1.5.7"
|
||||
version = "1.6.0"
|
||||
description = "Signatures for entire Python programs. Extract the structure, the frame, the skeleton of your project, to generate API documentation or find breaking changes in your API."
|
||||
optional = false
|
||||
python-versions = ">=3.9"
|
||||
files = [
|
||||
{file = "griffe-1.5.7-py3-none-any.whl", hash = "sha256:4af8ec834b64de954d447c7b6672426bb145e71605c74a4e22d510cc79fe7d8b"},
|
||||
{file = "griffe-1.5.7.tar.gz", hash = "sha256:465238c86deaf1137761f700fb343edd8ffc846d72f6de43c3c345ccdfbebe92"},
|
||||
{file = "griffe-1.6.0-py3-none-any.whl", hash = "sha256:9f1dfe035d4715a244ed2050dfbceb05b1f470809ed4f6bb10ece5a7302f8dd1"},
|
||||
{file = "griffe-1.6.0.tar.gz", hash = "sha256:eb5758088b9c73ad61c7ac014f3cdfb4c57b5c2fcbfca69996584b702aefa354"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
@ -1818,18 +1836,18 @@ testing = ["Django", "attrs", "colorama", "docopt", "pytest (<9.0.0)"]
|
||||
|
||||
[[package]]
|
||||
name = "jeepney"
|
||||
version = "0.8.0"
|
||||
version = "0.9.0"
|
||||
description = "Low-level, pure Python DBus protocol wrapper."
|
||||
optional = false
|
||||
python-versions = ">=3.7"
|
||||
files = [
|
||||
{file = "jeepney-0.8.0-py3-none-any.whl", hash = "sha256:c0a454ad016ca575060802ee4d590dd912e35c122fa04e70306de3d076cce755"},
|
||||
{file = "jeepney-0.8.0.tar.gz", hash = "sha256:5efe48d255973902f6badc3ce55e2aa6c5c3b3bc642059ef3a91247bcfcc5806"},
|
||||
{file = "jeepney-0.9.0-py3-none-any.whl", hash = "sha256:97e5714520c16fc0a45695e5365a2e11b81ea79bba796e26f9f1d178cb182683"},
|
||||
{file = "jeepney-0.9.0.tar.gz", hash = "sha256:cf0e9e845622b81e4a28df94c40345400256ec608d0e55bb8a3feaa9163f5732"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
test = ["async-timeout", "pytest", "pytest-asyncio (>=0.17)", "pytest-trio", "testpath", "trio"]
|
||||
trio = ["async_generator", "trio"]
|
||||
trio = ["trio"]
|
||||
|
||||
[[package]]
|
||||
name = "jinja2"
|
||||
@ -2715,17 +2733,18 @@ pygments = ">2.12.0"
|
||||
|
||||
[[package]]
|
||||
name = "mkdocs-material"
|
||||
version = "9.6.5"
|
||||
version = "9.6.7"
|
||||
description = "Documentation that simply works"
|
||||
optional = false
|
||||
python-versions = ">=3.8"
|
||||
files = [
|
||||
{file = "mkdocs_material-9.6.5-py3-none-any.whl", hash = "sha256:aad3e6fb860c20870f75fb2a69ef901f1be727891e41adb60b753efcae19453b"},
|
||||
{file = "mkdocs_material-9.6.5.tar.gz", hash = "sha256:b714679a8c91b0ffe2188e11ed58c44d2523e9c2ae26a29cc652fa7478faa21f"},
|
||||
{file = "mkdocs_material-9.6.7-py3-none-any.whl", hash = "sha256:8a159e45e80fcaadd9fbeef62cbf928569b93df954d4dc5ba76d46820caf7b47"},
|
||||
{file = "mkdocs_material-9.6.7.tar.gz", hash = "sha256:3e2c1fceb9410056c2d91f334a00cdea3215c28750e00c691c1e46b2a33309b4"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
babel = ">=2.10,<3.0"
|
||||
backrefs = ">=5.7.post1,<6.0"
|
||||
colorama = ">=0.4,<1.0"
|
||||
jinja2 = ">=3.0,<4.0"
|
||||
markdown = ">=3.2,<4.0"
|
||||
@ -2734,7 +2753,6 @@ mkdocs-material-extensions = ">=1.3,<2.0"
|
||||
paginate = ">=0.5,<1.0"
|
||||
pygments = ">=2.16,<3.0"
|
||||
pymdown-extensions = ">=10.2,<11.0"
|
||||
regex = ">=2022.4"
|
||||
requests = ">=2.26,<3.0"
|
||||
|
||||
[package.extras]
|
||||
@ -2822,8 +2840,8 @@ files = [
|
||||
|
||||
[package.dependencies]
|
||||
multiprocess = [
|
||||
{version = ">=0.70.15", optional = true, markers = "python_version >= \"3.11\" and extra == \"dill\""},
|
||||
{version = "*", optional = true, markers = "python_version < \"3.11\" and extra == \"dill\""},
|
||||
{version = ">=0.70.15", optional = true, markers = "python_version >= \"3.11\" and extra == \"dill\""},
|
||||
]
|
||||
pygments = ">=2.0"
|
||||
pywin32 = {version = ">=301", markers = "platform_system == \"Windows\""}
|
||||
@ -3832,10 +3850,10 @@ files = [
|
||||
|
||||
[package.dependencies]
|
||||
numpy = [
|
||||
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
||||
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
|
||||
{version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\" and python_version < \"3.11\""},
|
||||
{version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\" and python_version < \"3.11\""},
|
||||
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
|
||||
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
||||
{version = ">=1.21.0", markers = "python_version == \"3.9\" and platform_system == \"Darwin\" and platform_machine == \"arm64\""},
|
||||
{version = ">=1.19.3", markers = "platform_system == \"Linux\" and platform_machine == \"aarch64\" and python_version >= \"3.8\" and python_version < \"3.10\" or python_version > \"3.9\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_system != \"Darwin\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_machine != \"arm64\" and python_version < \"3.10\""},
|
||||
]
|
||||
@ -3858,10 +3876,10 @@ files = [
|
||||
|
||||
[package.dependencies]
|
||||
numpy = [
|
||||
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
||||
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
|
||||
{version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\" and python_version < \"3.11\""},
|
||||
{version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\" and python_version < \"3.11\""},
|
||||
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
|
||||
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
||||
{version = ">=1.21.0", markers = "python_version == \"3.9\" and platform_system == \"Darwin\" and platform_machine == \"arm64\""},
|
||||
{version = ">=1.19.3", markers = "platform_system == \"Linux\" and platform_machine == \"aarch64\" and python_version >= \"3.8\" and python_version < \"3.10\" or python_version > \"3.9\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_system != \"Darwin\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_machine != \"arm64\" and python_version < \"3.10\""},
|
||||
]
|
||||
@ -4047,9 +4065,9 @@ files = [
|
||||
|
||||
[package.dependencies]
|
||||
numpy = [
|
||||
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
||||
{version = ">=1.23.2", markers = "python_version == \"3.11\""},
|
||||
{version = ">=1.22.4", markers = "python_version < \"3.11\""},
|
||||
{version = ">=1.23.2", markers = "python_version == \"3.11\""},
|
||||
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
||||
]
|
||||
python-dateutil = ">=2.8.2"
|
||||
pytz = ">=2020.1"
|
||||
@ -4755,13 +4773,13 @@ typing-extensions = ">=4.6.0,<4.7.0 || >4.7.0"
|
||||
|
||||
[[package]]
|
||||
name = "pydantic-settings"
|
||||
version = "2.8.0"
|
||||
version = "2.8.1"
|
||||
description = "Settings management using Pydantic"
|
||||
optional = false
|
||||
python-versions = ">=3.8"
|
||||
files = [
|
||||
{file = "pydantic_settings-2.8.0-py3-none-any.whl", hash = "sha256:c782c7dc3fb40e97b238e713c25d26f64314aece2e91abcff592fcac15f71820"},
|
||||
{file = "pydantic_settings-2.8.0.tar.gz", hash = "sha256:88e2ca28f6e68ea102c99c3c401d6c9078e68a5df600e97b43891c34e089500a"},
|
||||
{file = "pydantic_settings-2.8.1-py3-none-any.whl", hash = "sha256:81942d5ac3d905f7f3ee1a70df5dfb62d5569c12f51a5a647defc1c3d9ee2e9c"},
|
||||
{file = "pydantic_settings-2.8.1.tar.gz", hash = "sha256:d5c663dfbe9db9d5e1c646b2e161da12f0d734d422ee56f567d0ea2cee4e8585"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
@ -4798,6 +4816,16 @@ files = [
|
||||
[package.extras]
|
||||
windows-terminal = ["colorama (>=0.4.6)"]
|
||||
|
||||
[[package]]
|
||||
name = "pylatexenc"
|
||||
version = "2.10"
|
||||
description = "Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion"
|
||||
optional = false
|
||||
python-versions = "*"
|
||||
files = [
|
||||
{file = "pylatexenc-2.10.tar.gz", hash = "sha256:3dd8fd84eb46dc30bee1e23eaab8d8fb5a7f507347b23e5f38ad9675c84f40d3"},
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "pylint"
|
||||
version = "2.17.7"
|
||||
@ -4813,8 +4841,8 @@ files = [
|
||||
astroid = ">=2.15.8,<=2.17.0-dev0"
|
||||
colorama = {version = ">=0.4.5", markers = "sys_platform == \"win32\""}
|
||||
dill = [
|
||||
{version = ">=0.3.6", markers = "python_version >= \"3.11\""},
|
||||
{version = ">=0.2", markers = "python_version < \"3.11\""},
|
||||
{version = ">=0.3.6", markers = "python_version >= \"3.11\""},
|
||||
]
|
||||
isort = ">=4.2.5,<6"
|
||||
mccabe = ">=0.6,<0.8"
|
||||
@ -5897,26 +5925,26 @@ files = [
|
||||
|
||||
[[package]]
|
||||
name = "safetensors"
|
||||
version = "0.5.2"
|
||||
version = "0.5.3"
|
||||
description = ""
|
||||
optional = false
|
||||
python-versions = ">=3.7"
|
||||
files = [
|
||||
{file = "safetensors-0.5.2-cp38-abi3-macosx_10_12_x86_64.whl", hash = "sha256:45b6092997ceb8aa3801693781a71a99909ab9cc776fbc3fa9322d29b1d3bef2"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-macosx_11_0_arm64.whl", hash = "sha256:6d0d6a8ee2215a440e1296b843edf44fd377b055ba350eaba74655a2fe2c4bae"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:86016d40bcaa3bcc9a56cd74d97e654b5f4f4abe42b038c71e4f00a089c4526c"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:990833f70a5f9c7d3fc82c94507f03179930ff7d00941c287f73b6fcbf67f19e"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3dfa7c2f3fe55db34eba90c29df94bcdac4821043fc391cb5d082d9922013869"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:46ff2116150ae70a4e9c490d2ab6b6e1b1b93f25e520e540abe1b81b48560c3a"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3ab696dfdc060caffb61dbe4066b86419107a24c804a4e373ba59be699ebd8d5"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:03c937100f38c9ff4c1507abea9928a6a9b02c9c1c9c3609ed4fb2bf413d4975"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:a00e737948791b94dad83cf0eafc09a02c4d8c2171a239e8c8572fe04e25960e"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:d3a06fae62418ec8e5c635b61a8086032c9e281f16c63c3af46a6efbab33156f"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-musllinux_1_2_i686.whl", hash = "sha256:1506e4c2eda1431099cebe9abf6c76853e95d0b7a95addceaa74c6019c65d8cf"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:5c5b5d9da594f638a259fca766046f44c97244cc7ab8bef161b3e80d04becc76"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-win32.whl", hash = "sha256:fe55c039d97090d1f85277d402954dd6ad27f63034fa81985a9cc59655ac3ee2"},
|
||||
{file = "safetensors-0.5.2-cp38-abi3-win_amd64.whl", hash = "sha256:78abdddd03a406646107f973c7843276e7b64e5e32623529dc17f3d94a20f589"},
|
||||
{file = "safetensors-0.5.2.tar.gz", hash = "sha256:cb4a8d98ba12fa016f4241932b1fc5e702e5143f5374bba0bbcf7ddc1c4cf2b8"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-macosx_10_12_x86_64.whl", hash = "sha256:bd20eb133db8ed15b40110b7c00c6df51655a2998132193de2f75f72d99c7073"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-macosx_11_0_arm64.whl", hash = "sha256:21d01c14ff6c415c485616b8b0bf961c46b3b343ca59110d38d744e577f9cce7"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:11bce6164887cd491ca75c2326a113ba934be596e22b28b1742ce27b1d076467"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:4a243be3590bc3301c821da7a18d87224ef35cbd3e5f5727e4e0728b8172411e"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8bd84b12b1670a6f8e50f01e28156422a2bc07fb16fc4e98bded13039d688a0d"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:391ac8cab7c829452175f871fcaf414aa1e292b5448bd02620f675a7f3e7abb9"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cead1fa41fc54b1e61089fa57452e8834f798cb1dc7a09ba3524f1eb08e0317a"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1077f3e94182d72618357b04b5ced540ceb71c8a813d3319f1aba448e68a770d"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:799021e78287bac619c7b3f3606730a22da4cda27759ddf55d37c8db7511c74b"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:df26da01aaac504334644e1b7642fa000bfec820e7cef83aeac4e355e03195ff"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-musllinux_1_2_i686.whl", hash = "sha256:32c3ef2d7af8b9f52ff685ed0bc43913cdcde135089ae322ee576de93eae5135"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:37f1521be045e56fc2b54c606d4455573e717b2d887c579ee1dbba5f868ece04"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-win32.whl", hash = "sha256:cfc0ec0846dcf6763b0ed3d1846ff36008c6e7290683b61616c4b040f6a54ace"},
|
||||
{file = "safetensors-0.5.3-cp38-abi3-win_amd64.whl", hash = "sha256:836cbbc320b47e80acd40e44c8682db0e8ad7123209f69b093def21ec7cafd11"},
|
||||
{file = "safetensors-0.5.3.tar.gz", hash = "sha256:b6b0d6ecacec39a4fdd99cc19f4576f5219ce858e6fd8dbe7609df0b8dc56965"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
@ -6213,13 +6241,13 @@ train = ["accelerate (>=0.20.3)", "datasets"]
|
||||
|
||||
[[package]]
|
||||
name = "setuptools"
|
||||
version = "75.8.1"
|
||||
version = "75.8.2"
|
||||
description = "Easily download, build, install, upgrade, and uninstall Python packages"
|
||||
optional = false
|
||||
python-versions = ">=3.9"
|
||||
files = [
|
||||
{file = "setuptools-75.8.1-py3-none-any.whl", hash = "sha256:3bc32c0b84c643299ca94e77f834730f126efd621de0cc1de64119e0e17dab1f"},
|
||||
{file = "setuptools-75.8.1.tar.gz", hash = "sha256:65fb779a8f28895242923582eadca2337285f0891c2c9e160754df917c3d2530"},
|
||||
{file = "setuptools-75.8.2-py3-none-any.whl", hash = "sha256:558e47c15f1811c1fa7adbd0096669bf76c1d3f433f58324df69f3f5ecac4e8f"},
|
||||
{file = "setuptools-75.8.2.tar.gz", hash = "sha256:4880473a969e5f23f2a2be3646b2dfd84af9028716d398e46192f84bc36900d2"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
@ -7217,13 +7245,13 @@ files = [
|
||||
|
||||
[[package]]
|
||||
name = "types-requests"
|
||||
version = "2.32.0.20241016"
|
||||
version = "2.32.0.20250301"
|
||||
description = "Typing stubs for requests"
|
||||
optional = false
|
||||
python-versions = ">=3.8"
|
||||
python-versions = ">=3.9"
|
||||
files = [
|
||||
{file = "types-requests-2.32.0.20241016.tar.gz", hash = "sha256:0d9cad2f27515d0e3e3da7134a1b6f28fb97129d86b867f24d9c726452634d95"},
|
||||
{file = "types_requests-2.32.0.20241016-py3-none-any.whl", hash = "sha256:4195d62d6d3e043a4eaaf08ff8a62184584d2e8684e9d2aa178c7915a7da3747"},
|
||||
{file = "types_requests-2.32.0.20250301-py3-none-any.whl", hash = "sha256:0003e0124e2cbefefb88222ff822b48616af40c74df83350f599a650c8de483b"},
|
||||
{file = "types_requests-2.32.0.20250301.tar.gz", hash = "sha256:3d909dc4eaab159c0d964ebe8bfa326a7afb4578d8706408d417e17d61b0c500"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
@ -7231,13 +7259,13 @@ urllib3 = ">=2"
|
||||
|
||||
[[package]]
|
||||
name = "types-tqdm"
|
||||
version = "4.67.0.20241221"
|
||||
version = "4.67.0.20250301"
|
||||
description = "Typing stubs for tqdm"
|
||||
optional = false
|
||||
python-versions = ">=3.8"
|
||||
python-versions = ">=3.9"
|
||||
files = [
|
||||
{file = "types_tqdm-4.67.0.20241221-py3-none-any.whl", hash = "sha256:a1f1c9cda5c2d8482d2c73957a5398bfdedda10f6bc7b3b4e812d5c910486d29"},
|
||||
{file = "types_tqdm-4.67.0.20241221.tar.gz", hash = "sha256:e56046631056922385abe89aeb18af5611f471eadd7918a0ad7f34d84cd4c8cc"},
|
||||
{file = "types_tqdm-4.67.0.20250301-py3-none-any.whl", hash = "sha256:8af97deb8e6874af833555dc1fe0fcd456b1a789470bf6cd8813d4e7ee4f6c5b"},
|
||||
{file = "types_tqdm-4.67.0.20250301.tar.gz", hash = "sha256:5e89a38ad89b867823368eb97d9f90d2fc69806bb055dde62716a05da62b5e0d"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
@ -7833,4 +7861,4 @@ vlm = ["accelerate", "transformers", "transformers"]
|
||||
[metadata]
|
||||
lock-version = "2.0"
|
||||
python-versions = "^3.9"
|
||||
content-hash = "1d4718b694098b0676f1ad1606d769887e51fc29f604e5f4c83dd5e1c90557e7"
|
||||
content-hash = "c37ae7d39cb2af7031248c2f0308c91160facafd948e982899245e5d8369bbbb"
|
||||
|
108
pyproject.toml
108
pyproject.toml
@ -1,24 +1,44 @@
|
||||
[tool.poetry]
|
||||
name = "docling"
|
||||
version = "2.25.2" # DO NOT EDIT, updated automatically
|
||||
version = "2.26.0" # DO NOT EDIT, updated automatically
|
||||
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
||||
authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
|
||||
authors = [
|
||||
"Christoph Auer <cau@zurich.ibm.com>",
|
||||
"Michele Dolfi <dol@zurich.ibm.com>",
|
||||
"Maxim Lysak <mly@zurich.ibm.com>",
|
||||
"Nikos Livathinos <nli@zurich.ibm.com>",
|
||||
"Ahmed Nassar <ahn@zurich.ibm.com>",
|
||||
"Panos Vagenas <pva@zurich.ibm.com>",
|
||||
"Peter Staar <taa@zurich.ibm.com>",
|
||||
]
|
||||
license = "MIT"
|
||||
readme = "README.md"
|
||||
repository = "https://github.com/DS4SD/docling"
|
||||
homepage = "https://github.com/DS4SD/docling"
|
||||
keywords= ["docling", "convert", "document", "pdf", "docx", "html", "markdown", "layout model", "segmentation", "table structure", "table former"]
|
||||
classifiers = [
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Operating System :: MacOS :: MacOS X",
|
||||
"Operating System :: POSIX :: Linux",
|
||||
"Development Status :: 5 - Production/Stable",
|
||||
"Intended Audience :: Developers",
|
||||
"Intended Audience :: Science/Research",
|
||||
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
||||
"Programming Language :: Python :: 3"
|
||||
]
|
||||
packages = [{include = "docling"}]
|
||||
repository = "https://github.com/docling-project/docling"
|
||||
homepage = "https://github.com/docling-project/docling"
|
||||
keywords = [
|
||||
"docling",
|
||||
"convert",
|
||||
"document",
|
||||
"pdf",
|
||||
"docx",
|
||||
"html",
|
||||
"markdown",
|
||||
"layout model",
|
||||
"segmentation",
|
||||
"table structure",
|
||||
"table former",
|
||||
]
|
||||
classifiers = [
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Operating System :: MacOS :: MacOS X",
|
||||
"Operating System :: POSIX :: Linux",
|
||||
"Development Status :: 5 - Production/Stable",
|
||||
"Intended Audience :: Developers",
|
||||
"Intended Audience :: Science/Research",
|
||||
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
||||
"Programming Language :: Python :: 3",
|
||||
]
|
||||
packages = [{ include = "docling" }]
|
||||
|
||||
[tool.poetry.dependencies]
|
||||
######################
|
||||
@ -26,7 +46,7 @@ packages = [{include = "docling"}]
|
||||
######################
|
||||
python = "^3.9"
|
||||
pydantic = "^2.0.0"
|
||||
docling-core = {extras = ["chunking"], version = "^2.19.0"}
|
||||
docling-core = {extras = ["chunking"], version = "^2.22.0"}
|
||||
docling-ibm-models = "^3.4.0"
|
||||
docling-parse = "^3.3.0"
|
||||
filetype = "^1.2.0"
|
||||
@ -40,7 +60,7 @@ certifi = ">=2024.7.4"
|
||||
rtree = "^1.3.0"
|
||||
scipy = [
|
||||
{ version = "^1.6.0", markers = "python_version >= '3.10'" },
|
||||
{ version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" }
|
||||
{ version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" },
|
||||
]
|
||||
typer = "^0.12.5"
|
||||
python-docx = "^1.1.2"
|
||||
@ -56,21 +76,22 @@ onnxruntime = [
|
||||
# 1.19.2 is the last version with python3.9 support,
|
||||
# see https://github.com/microsoft/onnxruntime/releases/tag/v1.20.0
|
||||
{ version = ">=1.7.0,<1.20.0", optional = true, markers = "python_version < '3.10'" },
|
||||
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" }
|
||||
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" },
|
||||
]
|
||||
|
||||
transformers = [
|
||||
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^4.46.0", optional = true },
|
||||
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~4.42.0", optional = true }
|
||||
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^4.46.0", optional = true },
|
||||
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~4.42.0", optional = true },
|
||||
]
|
||||
accelerate = [
|
||||
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^1.2.1", optional = true },
|
||||
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^1.2.1", optional = true },
|
||||
]
|
||||
pillow = ">=10.0.0,<12.0.0"
|
||||
tqdm = "^4.65.0"
|
||||
pylatexenc = "^2.10"
|
||||
|
||||
[tool.poetry.group.dev.dependencies]
|
||||
black = {extras = ["jupyter"], version = "^24.4.2"}
|
||||
black = { extras = ["jupyter"], version = "^24.4.2" }
|
||||
pytest = "^7.2.2"
|
||||
pre-commit = "^3.7.1"
|
||||
mypy = "^1.10.1"
|
||||
@ -93,7 +114,7 @@ types-tqdm = "^4.67.0.20241221"
|
||||
mkdocs-material = "^9.5.40"
|
||||
mkdocs-jupyter = "^0.25.0"
|
||||
mkdocs-click = "^0.8.1"
|
||||
mkdocstrings = {extras = ["python"], version = "^0.27.0"}
|
||||
mkdocstrings = { extras = ["python"], version = "^0.27.0" }
|
||||
griffe-pydantic = "^1.1.0"
|
||||
|
||||
[tool.poetry.group.examples.dependencies]
|
||||
@ -108,8 +129,8 @@ optional = true
|
||||
|
||||
[tool.poetry.group.constraints.dependencies]
|
||||
numpy = [
|
||||
{ version = ">=1.24.4,<3.0.0", markers = 'python_version >= "3.10"' },
|
||||
{ version = ">=1.24.4,<2.1.0", markers = 'python_version < "3.10"' },
|
||||
{ version = ">=1.24.4,<3.0.0", markers = 'python_version >= "3.10"' },
|
||||
{ version = ">=1.24.4,<2.1.0", markers = 'python_version < "3.10"' },
|
||||
]
|
||||
|
||||
[tool.poetry.group.mac_intel]
|
||||
@ -117,12 +138,12 @@ optional = true
|
||||
|
||||
[tool.poetry.group.mac_intel.dependencies]
|
||||
torch = [
|
||||
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^2.2.2"},
|
||||
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~2.2.2"}
|
||||
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^2.2.2" },
|
||||
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~2.2.2" },
|
||||
]
|
||||
torchvision = [
|
||||
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^0"},
|
||||
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~0.17.2"}
|
||||
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^0" },
|
||||
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~0.17.2" },
|
||||
]
|
||||
|
||||
[tool.poetry.extras]
|
||||
@ -147,7 +168,7 @@ include = '\.pyi?$'
|
||||
[tool.isort]
|
||||
profile = "black"
|
||||
line_length = 88
|
||||
py_version=39
|
||||
py_version = 39
|
||||
|
||||
[tool.mypy]
|
||||
pretty = true
|
||||
@ -158,18 +179,19 @@ python_version = "3.10"
|
||||
|
||||
[[tool.mypy.overrides]]
|
||||
module = [
|
||||
"docling_parse.*",
|
||||
"pypdfium2.*",
|
||||
"networkx.*",
|
||||
"scipy.*",
|
||||
"filetype.*",
|
||||
"tesserocr.*",
|
||||
"docling_ibm_models.*",
|
||||
"easyocr.*",
|
||||
"ocrmac.*",
|
||||
"lxml.*",
|
||||
"huggingface_hub.*",
|
||||
"transformers.*",
|
||||
"docling_parse.*",
|
||||
"pypdfium2.*",
|
||||
"networkx.*",
|
||||
"scipy.*",
|
||||
"filetype.*",
|
||||
"tesserocr.*",
|
||||
"docling_ibm_models.*",
|
||||
"easyocr.*",
|
||||
"ocrmac.*",
|
||||
"lxml.*",
|
||||
"huggingface_hub.*",
|
||||
"transformers.*",
|
||||
"pylatexenc.*",
|
||||
]
|
||||
ignore_missing_imports = true
|
||||
|
||||
|
BIN
tests/data/docx/equations.docx
Normal file
BIN
tests/data/docx/equations.docx
Normal file
Binary file not shown.
@ -12,7 +12,7 @@
|
||||
</figure>
|
||||
<table>
|
||||
<location><page_1><loc_52><loc_62><loc_88><loc_71></location>
|
||||
<row_0><col_0><col_header>3</col_0><col_1><col_header>1</col_1></row_0>
|
||||
<row_0><col_0><col_header>1</col_0></row_0>
|
||||
</table>
|
||||
<paragraph><location><page_1><loc_52><loc_58><loc_79><loc_60></location>- b. Red-annotation of bounding boxes, Blue-predictions by TableFormer</paragraph>
|
||||
<paragraph><location><page_1><loc_52><loc_46><loc_80><loc_47></location>- c. Structure predicted by TableFormer:</paragraph>
|
||||
@ -25,11 +25,11 @@
|
||||
</figure>
|
||||
<table>
|
||||
<location><page_1><loc_52><loc_37><loc_88><loc_45></location>
|
||||
<row_0><col_0><col_header>0</col_0><col_1><col_header>1</col_1><col_2><col_header>1</col_2><col_3><col_header>2 1</col_3><col_4><col_header>2 1</col_4><col_5><body></col_5></row_0>
|
||||
<row_1><col_0><body>3</col_0><col_1><body>4</col_1><col_2><body>5 3</col_2><col_3><body>6</col_3><col_4><body>7</col_4><col_5><body></col_5></row_1>
|
||||
<row_2><col_0><body>8</col_0><col_1><body>9</col_1><col_2><body>10</col_2><col_3><body>11</col_3><col_4><body>12</col_4><col_5><body>2</col_5></row_2>
|
||||
<row_3><col_0><body></col_0><col_1><body>13</col_1><col_2><body>14</col_2><col_3><body>15</col_3><col_4><body>16</col_4><col_5><body>2</col_5></row_3>
|
||||
<row_4><col_0><body></col_0><col_1><body>17</col_1><col_2><body>18</col_2><col_3><body>19</col_3><col_4><body>20</col_4><col_5><body>2</col_5></row_4>
|
||||
<row_0><col_0><body>0</col_0><col_1><body>1 2 1</col_1><col_2><body>1 2 1</col_2><col_3><body>1 2 1</col_3><col_4><body>1 2 1</col_4></row_0>
|
||||
<row_1><col_0><body>3</col_0><col_1><body>4 3</col_1><col_2><body>5</col_2><col_3><body>6</col_3><col_4><body>7</col_4></row_1>
|
||||
<row_2><col_0><body>8 2</col_0><col_1><body>9</col_1><col_2><body>10</col_2><col_3><body>11</col_3><col_4><body>12</col_4></row_2>
|
||||
<row_3><col_0><body>13</col_0><col_1><body></col_1><col_2><body>14</col_2><col_3><body>15</col_3><col_4><body>16</col_4></row_3>
|
||||
<row_4><col_0><body>17</col_0><col_1><body>18</col_1><col_2><body></col_2><col_3><body>19</col_3><col_4><body>20</col_4></row_4>
|
||||
</table>
|
||||
<paragraph><location><page_1><loc_50><loc_16><loc_89><loc_26></location>Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.</paragraph>
|
||||
<paragraph><location><page_1><loc_50><loc_10><loc_89><loc_16></location>The first problem is called table-location and has been previously addressed [30, 38, 19, 21, 23, 26, 8] with stateof-the-art object-detection networks (e.g. YOLO and later on Mask-RCNN [9]). For all practical purposes, it can be</paragraph>
|
||||
@ -138,9 +138,9 @@
|
||||
<location><page_7><loc_50><loc_62><loc_87><loc_69></location>
|
||||
<caption>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
|
||||
<row_0><col_0><col_header>Model</col_0><col_1><col_header>Dataset</col_1><col_2><col_header>mAP</col_2><col_3><col_header>mAP (PP)</col_3></row_0>
|
||||
<row_1><col_0><body>EDD+BBox</col_0><col_1><body>PubTabNet</col_1><col_2><body>79.2</col_2><col_3><body>82.7</col_3></row_1>
|
||||
<row_2><col_0><body>TableFormer</col_0><col_1><body>PubTabNet</col_1><col_2><body>82.1</col_2><col_3><body>86.8</col_3></row_2>
|
||||
<row_3><col_0><body>TableFormer</col_0><col_1><body>SynthTabNet</col_1><col_2><body>87.7</col_2><col_3><body>-</col_3></row_3>
|
||||
<row_1><col_0><row_header>EDD+BBox</col_0><col_1><body>PubTabNet</col_1><col_2><body>79.2</col_2><col_3><body>82.7</col_3></row_1>
|
||||
<row_2><col_0><row_header>TableFormer</col_0><col_1><body>PubTabNet</col_1><col_2><body>82.1</col_2><col_3><body>86.8</col_3></row_2>
|
||||
<row_3><col_0><row_header>TableFormer</col_0><col_1><body>SynthTabNet</col_1><col_2><body>87.7</col_2><col_3><body>-</col_3></row_3>
|
||||
</table>
|
||||
<caption><location><page_7><loc_50><loc_57><loc_89><loc_60></location>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
|
||||
<paragraph><location><page_7><loc_50><loc_34><loc_89><loc_54></location>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</paragraph>
|
||||
@ -179,7 +179,7 @@
|
||||
<row_6><col_0><row_header>第 17 回人工知能学会全国大会 (2003)</col_0><col_1><body>208</col_1><col_2><body>5</col_2><col_3><body>203</col_3><col_4><body>152</col_4><col_5><body>244</col_5></row_6>
|
||||
<row_7><col_0><row_header>自然言語処理研究会第 146 〜 155 回</col_0><col_1><body>98</col_1><col_2><body>2</col_2><col_3><body>96</col_3><col_4><body>150</col_4><col_5><body>232</col_5></row_7>
|
||||
<row_8><col_0><row_header>WWW から収集した論文</col_0><col_1><body>107</col_1><col_2><body>73</col_2><col_3><body>34</col_3><col_4><body>147</col_4><col_5><body>96</col_5></row_8>
|
||||
<row_9><col_0><body></col_0><col_1><body>945</col_1><col_2><body>294</col_2><col_3><body>651</col_3><col_4><body>1122</col_4><col_5><body>955</col_5></row_9>
|
||||
<row_9><col_0><row_header>計</col_0><col_1><body>945</col_1><col_2><body>294</col_2><col_3><body>651</col_3><col_4><body>1122</col_4><col_5><body>955</col_5></row_9>
|
||||
</table>
|
||||
<caption><location><page_8><loc_62><loc_62><loc_90><loc_63></location>Text is aligned to match original for ease of viewing</caption>
|
||||
<table>
|
||||
|
File diff suppressed because one or more lines are too long
@ -25,12 +25,12 @@ The occurrence of tables in documents is ubiquitous. They often summarise quanti
|
||||
Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
|
||||
<!-- image -->
|
||||
|
||||
| 0 | 1 | 1 | 2 1 | 2 1 | |
|
||||
|-----|-----|-----|-------|-------|----|
|
||||
| 3 | 4 | 5 3 | 6 | 7 | |
|
||||
| 8 | 9 | 10 | 11 | 12 | 2 |
|
||||
| | 13 | 14 | 15 | 16 | 2 |
|
||||
| | 17 | 18 | 19 | 20 | 2 |
|
||||
| 0 | 1 2 1 | 1 2 1 | 1 2 1 | 1 2 1 |
|
||||
|-----|---------|---------|---------|---------|
|
||||
| 3 | 4 3 | 5 | 6 | 7 |
|
||||
| 8 2 | 9 | 10 | 11 | 12 |
|
||||
| 13 | | 14 | 15 | 16 |
|
||||
| 17 | 18 | | 19 | 20 |
|
||||
|
||||
Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.
|
||||
|
||||
@ -241,7 +241,7 @@ Text is aligned to match original for ease of viewing
|
||||
| 第 17 回人工知能学会全国大会 (2003) | 208 | 5 | 203 | 152 | 244 |
|
||||
| 自然言語処理研究会第 146 〜 155 回 | 98 | 2 | 96 | 150 | 232 |
|
||||
| WWW から収集した論文 | 107 | 73 | 34 | 147 | 96 |
|
||||
| | 945 | 294 | 651 | 1122 | 955 |
|
||||
| 計 | 945 | 294 | 651 | 1122 | 955 |
|
||||
|
||||
| | Shares (in millions) | Shares (in millions) | Weighted Average Grant Date Fair Value | Weighted Average Grant Date Fair Value |
|
||||
|--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|
|
||||
|
File diff suppressed because one or more lines are too long
@ -56,7 +56,7 @@
|
||||
<table>
|
||||
<location><page_4><loc_16><loc_63><loc_84><loc_83></location>
|
||||
<caption>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption>
|
||||
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>% of Total</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
|
||||
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
|
||||
<row_1><col_0><col_header>class label</col_0><col_1><col_header>Count</col_1><col_2><col_header>Train</col_2><col_3><col_header>Test</col_3><col_4><col_header>Val</col_4><col_5><col_header>All</col_5><col_6><col_header>Fin</col_6><col_7><col_header>Man</col_7><col_8><col_header>Sci</col_8><col_9><col_header>Law</col_9><col_10><col_header>Pat</col_10><col_11><col_header>Ten</col_11></row_1>
|
||||
<row_2><col_0><row_header>Caption</col_0><col_1><body>22524</col_1><col_2><body>2.04</col_2><col_3><body>1.77</col_3><col_4><body>2.32</col_4><col_5><body>84-89</col_5><col_6><body>40-61</col_6><col_7><body>86-92</col_7><col_8><body>94-99</col_8><col_9><body>95-99</col_9><col_10><body>69-78</col_10><col_11><body>n/a</col_11></row_2>
|
||||
<row_3><col_0><row_header>Footnote</col_0><col_1><body>6318</col_1><col_2><body>0.60</col_2><col_3><body>0.31</col_3><col_4><body>0.58</col_4><col_5><body>83-91</col_5><col_6><body>n/a</col_6><col_7><body>100</col_7><col_8><body>62-88</col_8><col_9><body>85-94</col_9><col_10><body>n/a</col_10><col_11><body>82-97</col_11></row_3>
|
||||
@ -102,7 +102,7 @@
|
||||
<table>
|
||||
<location><page_6><loc_10><loc_56><loc_47><loc_75></location>
|
||||
<row_0><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>MRCNN</col_2><col_3><col_header>MRCNN</col_3><col_4><col_header>FRCNN</col_4><col_5><col_header>YOLO</col_5></row_0>
|
||||
<row_1><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>R50</col_2><col_3><col_header>R101</col_3><col_4><col_header>R101</col_4><col_5><col_header>v5x6</col_5></row_1>
|
||||
<row_1><col_0><body></col_0><col_1><body></col_1><col_2><col_header>R50</col_2><col_3><col_header>R101</col_3><col_4><col_header>R101</col_4><col_5><col_header>v5x6</col_5></row_1>
|
||||
<row_2><col_0><row_header>Caption</col_0><col_1><body>84-89</col_1><col_2><body>68.4</col_2><col_3><body>71.5</col_3><col_4><body>70.1</col_4><col_5><body>77.7</col_5></row_2>
|
||||
<row_3><col_0><row_header>Footnote</col_0><col_1><body>83-91</col_1><col_2><body>70.9</col_2><col_3><body>71.8</col_3><col_4><body>73.7</col_4><col_5><body>77.2</col_5></row_3>
|
||||
<row_4><col_0><row_header>Formula</col_0><col_1><body>83-85</col_1><col_2><body>60.1</col_2><col_3><body>63.4</col_3><col_4><body>63.5</col_4><col_5><body>66.2</col_5></row_4>
|
||||
@ -130,7 +130,7 @@
|
||||
<paragraph><location><page_7><loc_9><loc_84><loc_48><loc_89></location>Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.</paragraph>
|
||||
<table>
|
||||
<location><page_7><loc_13><loc_63><loc_44><loc_81></location>
|
||||
<row_0><col_0><col_header>Class-count</col_0><col_1><col_header>11</col_1><col_2><col_header>6</col_2><col_3><col_header>5</col_3><col_4><col_header>4</col_4></row_0>
|
||||
<row_0><col_0><body>Class-count</col_0><col_1><col_header>11</col_1><col_2><col_header>6</col_2><col_3><col_header>5</col_3><col_4><col_header>4</col_4></row_0>
|
||||
<row_1><col_0><row_header>Caption</col_0><col_1><body>68</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_1>
|
||||
<row_2><col_0><row_header>Footnote</col_0><col_1><body>71</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_2>
|
||||
<row_3><col_0><row_header>Formula</col_0><col_1><body>60</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_3>
|
||||
@ -178,17 +178,17 @@
|
||||
<row_1><col_0><col_header>Training on</col_0><col_1><col_header>labels</col_1><col_2><col_header>PLN</col_2><col_3><col_header>DB</col_3><col_4><col_header>DLN</col_4></row_1>
|
||||
<row_2><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Figure</col_1><col_2><body>96</col_2><col_3><body>43</col_3><col_4><body>23</col_4></row_2>
|
||||
<row_3><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Sec-header</col_1><col_2><body>87</col_2><col_3><body>-</col_3><col_4><body>32</col_4></row_3>
|
||||
<row_4><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Table</col_1><col_2><body>95</col_2><col_3><body>24</col_3><col_4><body>49</col_4></row_4>
|
||||
<row_5><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Text</col_1><col_2><body>96</col_2><col_3><body>-</col_3><col_4><body>42</col_4></row_5>
|
||||
<row_6><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>total</col_1><col_2><body>93</col_2><col_3><body>34</col_3><col_4><body>30</col_4></row_6>
|
||||
<row_4><col_0><body></col_0><col_1><row_header>Table</col_1><col_2><body>95</col_2><col_3><body>24</col_3><col_4><body>49</col_4></row_4>
|
||||
<row_5><col_0><body></col_0><col_1><row_header>Text</col_1><col_2><body>96</col_2><col_3><body>-</col_3><col_4><body>42</col_4></row_5>
|
||||
<row_6><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>93</col_2><col_3><body>34</col_3><col_4><body>30</col_4></row_6>
|
||||
<row_7><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>Figure</col_1><col_2><body>77</col_2><col_3><body>71</col_3><col_4><body>31</col_4></row_7>
|
||||
<row_8><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>Table</col_1><col_2><body>19</col_2><col_3><body>65</col_3><col_4><body>22</col_4></row_8>
|
||||
<row_9><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>total</col_1><col_2><body>48</col_2><col_3><body>68</col_3><col_4><body>27</col_4></row_9>
|
||||
<row_10><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Figure</col_1><col_2><body>67</col_2><col_3><body>51</col_3><col_4><body>72</col_4></row_10>
|
||||
<row_11><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Sec-header</col_1><col_2><body>53</col_2><col_3><body>-</col_3><col_4><body>68</col_4></row_11>
|
||||
<row_12><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Table</col_1><col_2><body>87</col_2><col_3><body>43</col_3><col_4><body>82</col_4></row_12>
|
||||
<row_13><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Text</col_1><col_2><body>77</col_2><col_3><body>-</col_3><col_4><body>84</col_4></row_13>
|
||||
<row_14><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
|
||||
<row_12><col_0><body></col_0><col_1><row_header>Table</col_1><col_2><body>87</col_2><col_3><body>43</col_3><col_4><body>82</col_4></row_12>
|
||||
<row_13><col_0><body></col_0><col_1><row_header>Text</col_1><col_2><body>77</col_2><col_3><body>-</col_3><col_4><body>84</col_4></row_13>
|
||||
<row_14><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
|
||||
</table>
|
||||
<paragraph><location><page_8><loc_9><loc_44><loc_48><loc_51></location>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</paragraph>
|
||||
<paragraph><location><page_8><loc_9><loc_26><loc_48><loc_44></location>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</paragraph>
|
||||
|
File diff suppressed because one or more lines are too long
@ -98,21 +98,21 @@ The annotation campaign was carried out in four phases. In phase one, we identif
|
||||
|
||||
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.
|
||||
|
||||
| | | % of Total | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|
||||
|----------------|---------|--------------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
|
||||
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
|
||||
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
|
||||
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
|
||||
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
|
||||
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
|
||||
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
|
||||
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
|
||||
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
|
||||
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
|
||||
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
|
||||
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
|
||||
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
|
||||
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
|
||||
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|
||||
|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
|
||||
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
|
||||
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
|
||||
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
|
||||
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
|
||||
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
|
||||
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
|
||||
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
|
||||
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
|
||||
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
|
||||
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
|
||||
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
|
||||
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
|
||||
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
|
||||
|
||||
Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
|
||||
<!-- image -->
|
||||
@ -161,7 +161,7 @@ Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on D
|
||||
|
||||
| | human | MRCNN | MRCNN | FRCNN | YOLO |
|
||||
|----------------|---------|---------|---------|---------|--------|
|
||||
| | human | R50 | R101 | R101 | v5x6 |
|
||||
| | | R50 | R101 | R101 | v5x6 |
|
||||
| Caption | 84-89 | 68.4 | 71.5 | 70.1 | 77.7 |
|
||||
| Footnote | 83-91 | 70.9 | 71.8 | 73.7 | 77.2 |
|
||||
| Formula | 83-85 | 60.1 | 63.4 | 63.5 | 66.2 |
|
||||
@ -252,17 +252,17 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
|
||||
| Training on | labels | PLN | DB | DLN |
|
||||
| PubLayNet (PLN) | Figure | 96 | 43 | 23 |
|
||||
| PubLayNet (PLN) | Sec-header | 87 | - | 32 |
|
||||
| PubLayNet (PLN) | Table | 95 | 24 | 49 |
|
||||
| PubLayNet (PLN) | Text | 96 | - | 42 |
|
||||
| PubLayNet (PLN) | total | 93 | 34 | 30 |
|
||||
| | Table | 95 | 24 | 49 |
|
||||
| | Text | 96 | - | 42 |
|
||||
| | total | 93 | 34 | 30 |
|
||||
| DocBank (DB) | Figure | 77 | 71 | 31 |
|
||||
| DocBank (DB) | Table | 19 | 65 | 22 |
|
||||
| DocBank (DB) | total | 48 | 68 | 27 |
|
||||
| DocLayNet (DLN) | Figure | 67 | 51 | 72 |
|
||||
| DocLayNet (DLN) | Sec-header | 53 | - | 68 |
|
||||
| DocLayNet (DLN) | Table | 87 | 43 | 82 |
|
||||
| DocLayNet (DLN) | Text | 77 | - | 84 |
|
||||
| DocLayNet (DLN) | total | 59 | 47 | 78 |
|
||||
| | Table | 87 | 43 | 82 |
|
||||
| | Text | 77 | - | 84 |
|
||||
| | total | 59 | 47 | 78 |
|
||||
|
||||
Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -5,13 +5,12 @@
|
||||
<table>
|
||||
<location><page_1><loc_23><loc_41><loc_78><loc_57></location>
|
||||
<caption>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
|
||||
<row_0><col_0><col_header>#</col_0><col_1><col_header>#</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
|
||||
<row_1><col_0><col_header>enc-layers</col_0><col_1><col_header>dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
|
||||
<row_0><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
|
||||
<row_1><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
|
||||
<row_2><col_0><body>6</col_0><col_1><body>6</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.965 0.969</col_3><col_4><body>0.934 0.927</col_4><col_5><body>0.955 0.955</col_5><col_6><body>0.88 0.857</col_6><col_7><body>2.73 5.39</col_7></row_2>
|
||||
<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938</col_3><col_4><body>0.904</col_4><col_5><body>0.927</col_5><col_6><body>0.853</col_6><col_7><body>1.97</col_7></row_3>
|
||||
<row_4><col_0><body></col_0><col_1><body></col_1><col_2><body>OTSL</col_2><col_3><body>0.952 0.923</col_3><col_4><body>0.909</col_4><col_5><body>0.938</col_5><col_6><body>0.843</col_6><col_7><body>3.77</col_7></row_4>
|
||||
<row_5><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>HTML</col_2><col_3><body>0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_5>
|
||||
<row_6><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_6>
|
||||
<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904 0.909</col_4><col_5><body>0.927 0.938</col_5><col_6><body>0.853 0.843</col_6><col_7><body>1.97 3.77</col_7></row_3>
|
||||
<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_4>
|
||||
<row_5><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_5>
|
||||
</table>
|
||||
<caption><location><page_1><loc_22><loc_59><loc_79><loc_66></location>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
|
||||
<subtitle-level-1><location><page_1><loc_22><loc_35><loc_43><loc_36></location>5.2 Quantitative Results</subtitle-level-1>
|
||||
|
File diff suppressed because one or more lines are too long
@ -6,14 +6,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly
|
||||
|
||||
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
|
||||
|
||||
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|
||||
|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
|
||||
| enc-layers | dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
|
||||
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| 4 | 4 | OTSL HTML | 0.938 | 0.904 | 0.927 | 0.853 | 1.97 |
|
||||
| | | OTSL | 0.952 0.923 | 0.909 | 0.938 | 0.843 | 3.77 |
|
||||
| 2 | 4 | HTML | 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
|
||||
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
|
||||
| # enc-layers | # dec-layers | Language | TEDs | TEDs | TEDs | mAP | Inference |
|
||||
|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
|
||||
| # enc-layers | # dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
|
||||
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77 |
|
||||
| 2 | 4 | OTSL HTML | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
|
||||
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
|
||||
|
||||
## 5.2 Quantitative Results
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -77,13 +77,12 @@
|
||||
<table>
|
||||
<location><page_9><loc_23><loc_41><loc_78><loc_57></location>
|
||||
<caption>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
|
||||
<row_0><col_0><col_header>#</col_0><col_1><col_header>#</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
|
||||
<row_1><col_0><col_header>enc-layers</col_0><col_1><col_header>dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
|
||||
<row_0><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
|
||||
<row_1><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
|
||||
<row_2><col_0><body>6</col_0><col_1><body>6</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.965 0.969</col_3><col_4><body>0.934 0.927</col_4><col_5><body>0.955 0.955</col_5><col_6><body>0.88 0.857</col_6><col_7><body>2.73 5.39</col_7></row_2>
|
||||
<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904</col_4><col_5><body>0.927</col_5><col_6><body>0.853</col_6><col_7><body>1.97</col_7></row_3>
|
||||
<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.909 0.897</col_4><col_5><body>0.938</col_5><col_6><body>0.843</col_6><col_7><body>3.77</col_7></row_4>
|
||||
<row_5><col_0><body></col_0><col_1><body></col_1><col_2><body>HTML</col_2><col_3><body></col_3><col_4><body>0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_5>
|
||||
<row_6><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_6>
|
||||
<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904 0.909</col_4><col_5><body>0.927 0.938</col_5><col_6><body>0.853 0.843</col_6><col_7><body>1.97 3.77</col_7></row_3>
|
||||
<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_4>
|
||||
<row_5><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_5>
|
||||
</table>
|
||||
<caption><location><page_9><loc_22><loc_59><loc_79><loc_65></location>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
|
||||
<subtitle-level-1><location><page_9><loc_22><loc_35><loc_43><loc_36></location>5.2 Quantitative Results</subtitle-level-1>
|
||||
@ -92,14 +91,11 @@
|
||||
<table>
|
||||
<location><page_10><loc_23><loc_67><loc_77><loc_80></location>
|
||||
<caption>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption>
|
||||
<row_0><col_0><body></col_0><col_1><col_header>Language</col_1><col_2><col_header>TEDs</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_0>
|
||||
<row_1><col_0><body></col_0><col_1><col_header>Language</col_1><col_2><col_header>simple</col_2><col_3><col_header>complex</col_3><col_4><col_header>all</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_1>
|
||||
<row_2><col_0><row_header>PubTabNet</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.965</col_2><col_3><body>0.934</col_3><col_4><body>0.955</col_4><col_5><body>0.88</col_5><col_6><body>2.73</col_6></row_2>
|
||||
<row_3><col_0><row_header>PubTabNet</col_0><col_1><row_header>HTML</col_1><col_2><body>0.969</col_2><col_3><body>0.927</col_3><col_4><body>0.955</col_4><col_5><body>0.857</col_5><col_6><body>5.39</col_6></row_3>
|
||||
<row_4><col_0><row_header>FinTabNet</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.955</col_2><col_3><body>0.961</col_3><col_4><body>0.959</col_4><col_5><body>0.862</col_5><col_6><body>1.85</col_6></row_4>
|
||||
<row_5><col_0><row_header>FinTabNet</col_0><col_1><row_header>HTML</col_1><col_2><body>0.917</col_2><col_3><body>0.922</col_3><col_4><body>0.92</col_4><col_5><body>0.722</col_5><col_6><body>3.26</col_6></row_5>
|
||||
<row_6><col_0><row_header>PubTables-1M</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.987</col_2><col_3><body>0.964</col_3><col_4><body>0.977</col_4><col_5><body>0.896</col_5><col_6><body>1.79</col_6></row_6>
|
||||
<row_7><col_0><row_header>PubTables-1M</col_0><col_1><row_header>HTML</col_1><col_2><body>0.983</col_2><col_3><body>0.944</col_3><col_4><body>0.966</col_4><col_5><body>0.889</col_5><col_6><body>3.26</col_6></row_7>
|
||||
<row_0><col_0><col_header>Data set</col_0><col_1><col_header>Language</col_1><col_2><col_header>TEDs</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_0>
|
||||
<row_1><col_0><col_header>Data set</col_0><col_1><col_header>Language</col_1><col_2><col_header>simple</col_2><col_3><col_header>complex</col_3><col_4><col_header>all</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_1>
|
||||
<row_2><col_0><body>PubTabNet</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.965 0.969</col_2><col_3><body>0.934 0.927</col_3><col_4><body>0.955 0.955</col_4><col_5><body>0.88 0.857</col_5><col_6><body>2.73 5.39</col_6></row_2>
|
||||
<row_3><col_0><body>FinTabNet</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.955 0.917</col_2><col_3><body>0.961 0.922</col_3><col_4><body>0.959 0.92</col_4><col_5><body>0.862 0.722</col_5><col_6><body>1.85 3.26</col_6></row_3>
|
||||
<row_4><col_0><body>PubTables-1M</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.987 0.983</col_2><col_3><body>0.964 0.944</col_3><col_4><body>0.977 0.966</col_4><col_5><body>0.896 0.889</col_5><col_6><body>1.79 3.26</col_6></row_4>
|
||||
</table>
|
||||
<caption><location><page_10><loc_22><loc_82><loc_79><loc_85></location>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption>
|
||||
<subtitle-level-1><location><page_10><loc_22><loc_62><loc_42><loc_64></location>5.3 Qualitative Results</subtitle-level-1>
|
||||
|
File diff suppressed because one or more lines are too long
@ -130,14 +130,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly
|
||||
|
||||
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
|
||||
|
||||
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|
||||
|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
|
||||
| enc-layers | dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
|
||||
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 | 0.927 | 0.853 | 1.97 |
|
||||
| 2 | 4 | OTSL | 0.923 0.945 | 0.909 0.897 | 0.938 | 0.843 | 3.77 |
|
||||
| | | HTML | | 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
|
||||
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
|
||||
| # enc-layers | # dec-layers | Language | TEDs | TEDs | TEDs | mAP | Inference |
|
||||
|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
|
||||
| # enc-layers | # dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
|
||||
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77 |
|
||||
| 2 | 4 | OTSL HTML | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
|
||||
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
|
||||
|
||||
## 5.2 Quantitative Results
|
||||
|
||||
@ -147,15 +146,12 @@ Additionally, the results show that OTSL has an advantage over HTML when applied
|
||||
|
||||
Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).
|
||||
|
||||
| | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
|
||||
|--------------|------------|--------|---------|--------|-------------|-------------------------|
|
||||
| | Language | simple | complex | all | mAP(0.75) | Inference time (secs) |
|
||||
| PubTabNet | OTSL | 0.965 | 0.934 | 0.955 | 0.88 | 2.73 |
|
||||
| PubTabNet | HTML | 0.969 | 0.927 | 0.955 | 0.857 | 5.39 |
|
||||
| FinTabNet | OTSL | 0.955 | 0.961 | 0.959 | 0.862 | 1.85 |
|
||||
| FinTabNet | HTML | 0.917 | 0.922 | 0.92 | 0.722 | 3.26 |
|
||||
| PubTables-1M | OTSL | 0.987 | 0.964 | 0.977 | 0.896 | 1.79 |
|
||||
| PubTables-1M | HTML | 0.983 | 0.944 | 0.966 | 0.889 | 3.26 |
|
||||
| Data set | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
|
||||
|--------------|------------|-------------|-------------|-------------|-------------|-------------------------|
|
||||
| Data set | Language | simple | complex | all | mAP(0.75) | Inference time (secs) |
|
||||
| PubTabNet | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| FinTabNet | OTSL HTML | 0.955 0.917 | 0.961 0.922 | 0.959 0.92 | 0.862 0.722 | 1.85 3.26 |
|
||||
| PubTables-1M | OTSL HTML | 0.987 0.983 | 0.964 0.944 | 0.977 0.966 | 0.896 0.889 | 1.79 3.26 |
|
||||
|
||||
## 5.3 Qualitative Results
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -130,7 +130,7 @@
|
||||
<table>
|
||||
<location><page_9><loc_11><loc_9><loc_89><loc_50></location>
|
||||
<caption>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
|
||||
<row_0><col_0><row_header>User action</col_0><col_1><body>*JOBCTL</col_1><col_2><body>QIBM_DB_SECADM</col_2><col_3><body>QIBM_DB_SQLADM</col_3><col_4><body>QIBM_DB_SYSMON</col_4><col_5><body>No Authority</col_5></row_0>
|
||||
<row_0><col_0><body>User action</col_0><col_1><col_header>*JOBCTL</col_1><col_2><col_header>QIBM_DB_SECADM</col_2><col_3><col_header>QIBM_DB_SQLADM</col_3><col_4><col_header>QIBM_DB_SYSMON</col_4><col_5><col_header>No Authority</col_5></row_0>
|
||||
<row_1><col_0><row_header>SET CURRENT DEGREE (SQL statement)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_1>
|
||||
<row_2><col_0><row_header>CHGQRYA command targeting a different user’s job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_2>
|
||||
<row_3><col_0><row_header>STRDBMON or ENDDBMON commands targeting a different user’s job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_3>
|
||||
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -8,13 +8,13 @@
|
||||
<section_header_level_1><loc_41><loc_341><loc_104><loc_348>1. Introduction</section_header_level_1>
|
||||
<text><loc_41><loc_354><loc_234><loc_450>The occurrence of tables in documents is ubiquitous. They often summarise quantitative or factual data, which is cumbersome to describe in verbose text but nevertheless extremely valuable. Unfortunately, this compact representation is often not easy to parse by machines. There are many implicit conventions used to obtain a compact table representation. For example, tables often have complex columnand row-headers in order to reduce duplicated cell content. Lines of different shapes and sizes are leveraged to separate content or indicate a tree structure. Additionally, tables can also have empty/missing table-entries or multi-row textual table-entries. Fig. 1 shows a table which presents all these issues.</text>
|
||||
<picture><loc_258><loc_144><loc_439><loc_191></picture>
|
||||
<otsl><loc_258><loc_144><loc_439><loc_191><ched>3<ched>1<nl></otsl>
|
||||
<otsl><loc_258><loc_144><loc_439><loc_191><ched>1<nl></otsl>
|
||||
<unordered_list><list_item><loc_258><loc_198><loc_397><loc_210>b. Red-annotation of bounding boxes, Blue-predictions by TableFormer</list_item>
|
||||
<list_item><loc_258><loc_265><loc_401><loc_271>c. Structure predicted by TableFormer:</list_item>
|
||||
</unordered_list>
|
||||
<picture><loc_257><loc_213><loc_441><loc_259></picture>
|
||||
<picture><loc_258><loc_274><loc_439><loc_313><caption><loc_252><loc_325><loc_445><loc_353>Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.</caption></picture>
|
||||
<otsl><loc_258><loc_274><loc_439><loc_313><ched>0<ched>1<lcel><ched>2 1<lcel><ecel><nl><fcel>3<fcel>4<fcel>5 3<fcel>6<fcel>7<ecel><nl><fcel>8<fcel>9<fcel>10<fcel>11<fcel>12<fcel>2<nl><ecel><fcel>13<fcel>14<fcel>15<fcel>16<ucel><nl><ecel><fcel>17<fcel>18<fcel>19<fcel>20<ucel><nl></otsl>
|
||||
<otsl><loc_258><loc_274><loc_439><loc_313><fcel>0<fcel>1 2 1<lcel><lcel><lcel><nl><fcel>3<fcel>4 3<fcel>5<fcel>6<fcel>7<nl><fcel>8 2<fcel>9<fcel>10<fcel>11<fcel>12<nl><fcel>13<ecel><fcel>14<fcel>15<fcel>16<nl><fcel>17<fcel>18<ecel><fcel>19<fcel>20<nl></otsl>
|
||||
<text><loc_252><loc_369><loc_445><loc_420>Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.</text>
|
||||
<text><loc_252><loc_422><loc_445><loc_450>The first problem is called table-location and has been previously addressed [30, 38, 19, 21, 23, 26, 8] with stateof-the-art object-detection networks (e.g. YOLO and later on Mask-RCNN [9]). For all practical purposes, it can be</text>
|
||||
<page_footer><loc_241><loc_463><loc_245><loc_469>1</page_footer>
|
||||
@ -102,7 +102,7 @@
|
||||
<text><loc_41><loc_374><loc_234><loc_387>Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN).</text>
|
||||
<text><loc_41><loc_389><loc_214><loc_395>FT: Model was trained on PubTabNet then finetuned.</text>
|
||||
<text><loc_41><loc_407><loc_234><loc_450><loc_41><loc_407><loc_234><loc_450>Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.</text>
|
||||
<otsl><loc_252><loc_156><loc_436><loc_192><ched>Model<ched>Dataset<ched>mAP<ched>mAP (PP)<nl><fcel>EDD+BBox<fcel>PubTabNet<fcel>79.2<fcel>82.7<nl><fcel>TableFormer<fcel>PubTabNet<fcel>82.1<fcel>86.8<nl><fcel>TableFormer<fcel>SynthTabNet<fcel>87.7<fcel>-<nl><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
|
||||
<otsl><loc_252><loc_156><loc_436><loc_192><ched>Model<ched>Dataset<ched>mAP<ched>mAP (PP)<nl><rhed>EDD+BBox<fcel>PubTabNet<fcel>79.2<fcel>82.7<nl><rhed>TableFormer<fcel>PubTabNet<fcel>82.1<fcel>86.8<nl><rhed>TableFormer<fcel>SynthTabNet<fcel>87.7<fcel>-<nl><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
|
||||
<text><loc_252><loc_232><loc_445><loc_328>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</text>
|
||||
<otsl><loc_272><loc_341><loc_426><loc_406><fcel>Model<ched>Simple<ched>TEDS Complex<ched>All<nl><rhed>Tabula<fcel>78.0<fcel>57.8<fcel>67.9<nl><rhed>Traprange<fcel>60.8<fcel>49.9<fcel>55.4<nl><rhed>Camelot<fcel>80.0<fcel>66.0<fcel>73.0<nl><rhed>Acrobat Pro<fcel>68.9<fcel>61.8<fcel>65.3<nl><rhed>EDD<fcel>91.2<fcel>85.4<fcel>88.3<nl><rhed>TableFormer<fcel>95.4<fcel>90.1<fcel>93.6<nl><caption><loc_252><loc_415><loc_445><loc_435>Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.</caption></otsl>
|
||||
<page_footer><loc_241><loc_463><loc_245><loc_469>7</page_footer>
|
||||
@ -114,7 +114,7 @@
|
||||
<section_header_level_1><loc_249><loc_60><loc_352><loc_64>Example table from FinTabNet:</section_header_level_1>
|
||||
<picture><loc_41><loc_65><loc_246><loc_118></picture>
|
||||
<picture><loc_250><loc_62><loc_453><loc_114><caption><loc_44><loc_131><loc_315><loc_136>b. Structure predicted by TableFormer, with superimposed matched PDF cell text:</caption></picture>
|
||||
<otsl><loc_44><loc_138><loc_244><loc_185><ecel><ecel><ched>論文ファイル<lcel><ched>参考文献<lcel><nl><ched>出典<ched>ファイル 数<ched>英語<ched>日本語<ched>英語<ched>日本語<nl><rhed>Association for Computational Linguistics(ACL2003)<fcel>65<fcel>65<fcel>0<fcel>150<fcel>0<nl><rhed>Computational Linguistics(COLING2002)<fcel>140<fcel>140<fcel>0<fcel>150<fcel>0<nl><rhed>電気情報通信学会 2003 年総合大会<fcel>150<fcel>8<fcel>142<fcel>223<fcel>147<nl><rhed>情報処理学会第 65 回全国大会 (2003)<fcel>177<fcel>1<fcel>176<fcel>150<fcel>236<nl><rhed>第 17 回人工知能学会全国大会 (2003)<fcel>208<fcel>5<fcel>203<fcel>152<fcel>244<nl><rhed>自然言語処理研究会第 146 〜 155 回<fcel>98<fcel>2<fcel>96<fcel>150<fcel>232<nl><rhed>WWW から収集した論文<fcel>107<fcel>73<fcel>34<fcel>147<fcel>96<nl><ecel><fcel>945<fcel>294<fcel>651<fcel>1122<fcel>955<nl><caption><loc_311><loc_185><loc_449><loc_189>Text is aligned to match original for ease of viewing</caption></otsl>
|
||||
<otsl><loc_44><loc_138><loc_244><loc_185><ecel><ecel><ched>論文ファイル<lcel><ched>参考文献<lcel><nl><ched>出典<ched>ファイル 数<ched>英語<ched>日本語<ched>英語<ched>日本語<nl><rhed>Association for Computational Linguistics(ACL2003)<fcel>65<fcel>65<fcel>0<fcel>150<fcel>0<nl><rhed>Computational Linguistics(COLING2002)<fcel>140<fcel>140<fcel>0<fcel>150<fcel>0<nl><rhed>電気情報通信学会 2003 年総合大会<fcel>150<fcel>8<fcel>142<fcel>223<fcel>147<nl><rhed>情報処理学会第 65 回全国大会 (2003)<fcel>177<fcel>1<fcel>176<fcel>150<fcel>236<nl><rhed>第 17 回人工知能学会全国大会 (2003)<fcel>208<fcel>5<fcel>203<fcel>152<fcel>244<nl><rhed>自然言語処理研究会第 146 〜 155 回<fcel>98<fcel>2<fcel>96<fcel>150<fcel>232<nl><rhed>WWW から収集した論文<fcel>107<fcel>73<fcel>34<fcel>147<fcel>96<nl><rhed>計<fcel>945<fcel>294<fcel>651<fcel>1122<fcel>955<nl><caption><loc_311><loc_185><loc_449><loc_189>Text is aligned to match original for ease of viewing</caption></otsl>
|
||||
<otsl><loc_249><loc_138><loc_450><loc_182><ecel><ched>Shares (in millions)<lcel><ched>Weighted Average Grant Date Fair Value<lcel><nl><ecel><ched>RS U s<ched>PSUs<ched>RSUs<ched>PSUs<nl><rhed>Nonvested on Janua ry 1<fcel>1. 1<fcel>0.3<fcel>90.10 $<fcel>$ 91.19<nl><rhed>Granted<fcel>0. 5<fcel>0.1<fcel>117.44<fcel>122.41<nl><rhed>Vested<fcel>(0. 5 )<fcel>(0.1)<fcel>87.08<fcel>81.14<nl><rhed>Canceled or forfeited<fcel>(0. 1 )<fcel>-<fcel>102.01<fcel>92.18<nl><rhed>Nonvested on December 31<fcel>1.0<fcel>0.3<fcel>104.85 $<fcel>$ 104.51<nl></otsl>
|
||||
<picture><loc_42><loc_240><loc_173><loc_280><caption><loc_51><loc_290><loc_435><loc_295>Figure 6: An example of TableFormer predictions (bounding boxes and structure) from generated SynthTabNet table.</caption></picture>
|
||||
<picture><loc_177><loc_240><loc_307><loc_280><caption><loc_41><loc_203><loc_445><loc_231>Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.</caption></picture>
|
||||
|
File diff suppressed because one or more lines are too long
@ -25,12 +25,12 @@ Figure 1: Picture of a table with subtle, complex features such as (1) multi-col
|
||||
|
||||
<!-- image -->
|
||||
|
||||
| 0 | 1 | 1 | 2 1 | 2 1 | |
|
||||
|-----|-----|-----|-------|-------|----|
|
||||
| 3 | 4 | 5 3 | 6 | 7 | |
|
||||
| 8 | 9 | 10 | 11 | 12 | 2 |
|
||||
| | 13 | 14 | 15 | 16 | 2 |
|
||||
| | 17 | 18 | 19 | 20 | 2 |
|
||||
| 0 | 1 2 1 | 1 2 1 | 1 2 1 | 1 2 1 |
|
||||
|-----|---------|---------|---------|---------|
|
||||
| 3 | 4 3 | 5 | 6 | 7 |
|
||||
| 8 2 | 9 | 10 | 11 | 12 |
|
||||
| 13 | | 14 | 15 | 16 |
|
||||
| 17 | 18 | | 19 | 20 |
|
||||
|
||||
Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.
|
||||
|
||||
@ -247,7 +247,7 @@ Text is aligned to match original for ease of viewing
|
||||
| 第 17 回人工知能学会全国大会 (2003) | 208 | 5 | 203 | 152 | 244 |
|
||||
| 自然言語処理研究会第 146 〜 155 回 | 98 | 2 | 96 | 150 | 232 |
|
||||
| WWW から収集した論文 | 107 | 73 | 34 | 147 | 96 |
|
||||
| | 945 | 294 | 651 | 1122 | 955 |
|
||||
| 計 | 945 | 294 | 651 | 1122 | 955 |
|
||||
|
||||
| | Shares (in millions) | Shares (in millions) | Weighted Average Grant Date Fair Value | Weighted Average Grant Date Fair Value |
|
||||
|--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|
|
||||
|
File diff suppressed because one or more lines are too long
@ -58,7 +58,7 @@
|
||||
<text><loc_260><loc_399><loc_457><loc_446>The annotation campaign was carried out in four phases. In phase one, we identified and prepared the data sources for annotation. In phase two, we determined the class labels and how annotations should be done on the documents in order to obtain maximum consistency. The latter was guided by a detailed requirement analysis and exhaustive experiments. In phase three, we trained the annotation staff and performed exams for quality assurance. In phase four,</text>
|
||||
<page_break>
|
||||
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
|
||||
<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
|
||||
<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
|
||||
<picture><loc_43><loc_196><loc_242><loc_341><caption><loc_44><loc_350><loc_242><loc_383>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption></picture>
|
||||
<text><loc_44><loc_400><loc_240><loc_426>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</text>
|
||||
<text><loc_44><loc_428><loc_241><loc_447><loc_44><loc_428><loc_241><loc_447>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv$^{3}$, government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
|
||||
@ -88,7 +88,7 @@
|
||||
<page_break>
|
||||
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
|
||||
<text><loc_44><loc_55><loc_242><loc_116>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</text>
|
||||
<otsl><loc_51><loc_124><loc_233><loc_222><ecel><ched>human<ched>MRCNN<lcel><ched>FRCNN<ched>YOLO<nl><ecel><ucel><ched>R50<ched>R101<ched>R101<ched>v5x6<nl><rhed>Caption<fcel>84-89<fcel>68.4<fcel>71.5<fcel>70.1<fcel>77.7<nl><rhed>Footnote<fcel>83-91<fcel>70.9<fcel>71.8<fcel>73.7<fcel>77.2<nl><rhed>Formula<fcel>83-85<fcel>60.1<fcel>63.4<fcel>63.5<fcel>66.2<nl><rhed>List-item<fcel>87-88<fcel>81.2<fcel>80.8<fcel>81.0<fcel>86.2<nl><rhed>Page-footer<fcel>93-94<fcel>61.6<fcel>59.3<fcel>58.9<fcel>61.1<nl><rhed>Page-header<fcel>85-89<fcel>71.9<fcel>70.0<fcel>72.0<fcel>67.9<nl><rhed>Picture<fcel>69-71<fcel>71.7<fcel>72.7<fcel>72.0<fcel>77.1<nl><rhed>Section-header<fcel>83-84<fcel>67.6<fcel>69.3<fcel>68.4<fcel>74.6<nl><rhed>Table<fcel>77-81<fcel>82.2<fcel>82.9<fcel>82.2<fcel>86.3<nl><rhed>Text<fcel>84-86<fcel>84.6<fcel>85.8<fcel>85.4<fcel>88.1<nl><rhed>Title<fcel>60-72<fcel>76.7<fcel>80.4<fcel>79.9<fcel>82.7<nl><rhed>All<fcel>82-83<fcel>72.4<fcel>73.5<fcel>73.4<fcel>76.8<nl></otsl>
|
||||
<otsl><loc_51><loc_124><loc_233><loc_222><ecel><ched>human<ched>MRCNN<lcel><ched>FRCNN<ched>YOLO<nl><ecel><ecel><ched>R50<ched>R101<ched>R101<ched>v5x6<nl><rhed>Caption<fcel>84-89<fcel>68.4<fcel>71.5<fcel>70.1<fcel>77.7<nl><rhed>Footnote<fcel>83-91<fcel>70.9<fcel>71.8<fcel>73.7<fcel>77.2<nl><rhed>Formula<fcel>83-85<fcel>60.1<fcel>63.4<fcel>63.5<fcel>66.2<nl><rhed>List-item<fcel>87-88<fcel>81.2<fcel>80.8<fcel>81.0<fcel>86.2<nl><rhed>Page-footer<fcel>93-94<fcel>61.6<fcel>59.3<fcel>58.9<fcel>61.1<nl><rhed>Page-header<fcel>85-89<fcel>71.9<fcel>70.0<fcel>72.0<fcel>67.9<nl><rhed>Picture<fcel>69-71<fcel>71.7<fcel>72.7<fcel>72.0<fcel>77.1<nl><rhed>Section-header<fcel>83-84<fcel>67.6<fcel>69.3<fcel>68.4<fcel>74.6<nl><rhed>Table<fcel>77-81<fcel>82.2<fcel>82.9<fcel>82.2<fcel>86.3<nl><rhed>Text<fcel>84-86<fcel>84.6<fcel>85.8<fcel>85.4<fcel>88.1<nl><rhed>Title<fcel>60-72<fcel>76.7<fcel>80.4<fcel>79.9<fcel>82.7<nl><rhed>All<fcel>82-83<fcel>72.4<fcel>73.5<fcel>73.4<fcel>76.8<nl></otsl>
|
||||
<text><loc_44><loc_234><loc_241><loc_364>to avoid this at any cost in order to have clear, unbiased baseline numbers for human document-layout annotation. Third, we introduced the feature of snapping boxes around text segments to obtain a pixel-accurate annotation and again reduce time and effort. The CCS annotation tool automatically shrinks every user-drawn box to the minimum bounding-box around the enclosed text-cells for all purely text-based segments, which excludes only Table and Picture . For the latter, we instructed annotation staff to minimise inclusion of surrounding whitespace while including all graphical lines. A downside of snapping boxes to enclosed text cells is that some wrongly parsed PDF pages cannot be annotated correctly and need to be skipped. Fourth, we established a way to flag pages as rejected for cases where no valid annotation according to the label guidelines could be achieved. Example cases for this would be PDF pages that render incorrectly or contain layouts that are impossible to capture with non-overlapping rectangles. Such rejected pages are not contained in the final dataset. With all these measures in place, experienced annotation staff managed to annotate a single page in a typical timeframe of 20s to 60s, depending on its complexity.</text>
|
||||
<section_header_level_1><loc_44><loc_371><loc_120><loc_378>5 EXPERIMENTS</section_header_level_1>
|
||||
<text><loc_44><loc_387><loc_241><loc_448>The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</text>
|
||||
@ -101,7 +101,7 @@
|
||||
<page_header><loc_44><loc_38><loc_284><loc_43>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</page_header>
|
||||
<page_header><loc_299><loc_38><loc_456><loc_43>KDD ’22, August 14-18, 2022, Washington, DC, USA</page_header>
|
||||
<text><loc_44><loc_55><loc_242><loc_81>Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.</text>
|
||||
<otsl><loc_66><loc_95><loc_218><loc_187><ched>Class-count<ched>11<ched>6<ched>5<ched>4<nl><rhed>Caption<fcel>68<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Footnote<fcel>71<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Formula<fcel>60<fcel>Text<fcel>Text<fcel>Text<nl><rhed>List-item<fcel>81<fcel>Text<fcel>82<fcel>Text<nl><rhed>Page-footer<fcel>62<fcel>62<fcel>-<fcel>-<nl><rhed>Page-header<fcel>72<fcel>68<fcel>-<fcel>-<nl><rhed>Picture<fcel>72<fcel>72<fcel>72<fcel>72<nl><rhed>Section-header<fcel>68<fcel>67<fcel>69<fcel>68<nl><rhed>Table<fcel>82<fcel>83<fcel>82<fcel>82<nl><rhed>Text<fcel>85<fcel>84<fcel>84<fcel>84<nl><rhed>Title<fcel>77<fcel>Sec.-h.<fcel>Sec.-h.<fcel>Sec.-h.<nl><rhed>Overall<fcel>72<fcel>73<fcel>78<fcel>77<nl></otsl>
|
||||
<otsl><loc_66><loc_95><loc_218><loc_187><fcel>Class-count<ched>11<ched>6<ched>5<ched>4<nl><rhed>Caption<fcel>68<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Footnote<fcel>71<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Formula<fcel>60<fcel>Text<fcel>Text<fcel>Text<nl><rhed>List-item<fcel>81<fcel>Text<fcel>82<fcel>Text<nl><rhed>Page-footer<fcel>62<fcel>62<fcel>-<fcel>-<nl><rhed>Page-header<fcel>72<fcel>68<fcel>-<fcel>-<nl><rhed>Picture<fcel>72<fcel>72<fcel>72<fcel>72<nl><rhed>Section-header<fcel>68<fcel>67<fcel>69<fcel>68<nl><rhed>Table<fcel>82<fcel>83<fcel>82<fcel>82<nl><rhed>Text<fcel>85<fcel>84<fcel>84<fcel>84<nl><rhed>Title<fcel>77<fcel>Sec.-h.<fcel>Sec.-h.<fcel>Sec.-h.<nl><rhed>Overall<fcel>72<fcel>73<fcel>78<fcel>77<nl></otsl>
|
||||
<section_header_level_1><loc_44><loc_202><loc_107><loc_208>Learning Curve</section_header_level_1>
|
||||
<text><loc_43><loc_211><loc_241><loc_334>One of the fundamental questions related to any dataset is if it is "large enough". To answer this question for DocLayNet, we performed a data ablation study in which we evaluated a Mask R-CNN model trained on increasing fractions of the DocLayNet dataset. As can be seen in Figure 5, the mAP score rises sharply in the beginning and eventually levels out. To estimate the error-bar on the metrics, we ran the training five times on the entire data-set. This resulted in a 1% error-bar, depicted by the shaded area in Figure 5. In the inset of Figure 5, we show the exact same data-points, but with a logarithmic scale on the x-axis. As is expected, the mAP score increases linearly as a function of the data-size in the inset. The curve ultimately flattens out between the 80% and 100% mark, with the 80% mark falling within the error-bars of the 100% mark. This provides a good indication that the model would not improve significantly by yet increasing the data size. Rather, it would probably benefit more from improved data consistency (as discussed in Section 3), data augmentation methods [23], or the addition of more document categories and styles.</text>
|
||||
<section_header_level_1><loc_44><loc_342><loc_134><loc_349>Impact of Class Labels</section_header_level_1>
|
||||
@ -116,7 +116,7 @@
|
||||
<page_break>
|
||||
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
|
||||
<text><loc_44><loc_55><loc_242><loc_95>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.</text>
|
||||
<otsl><loc_59><loc_109><loc_225><loc_215><ecel><ecel><ched>Testing on<lcel><lcel><nl><ched>Training on<ched>labels<ched>PLN<ched>DB<ched>DLN<nl><rhed>PubLayNet (PLN)<rhed>Figure<fcel>96<fcel>43<fcel>23<nl><ucel><rhed>Sec-header<fcel>87<fcel>-<fcel>32<nl><ucel><rhed>Table<fcel>95<fcel>24<fcel>49<nl><ucel><rhed>Text<fcel>96<fcel>-<fcel>42<nl><ucel><rhed>total<fcel>93<fcel>34<fcel>30<nl><rhed>DocBank (DB)<rhed>Figure<fcel>77<fcel>71<fcel>31<nl><ucel><rhed>Table<fcel>19<fcel>65<fcel>22<nl><ucel><rhed>total<fcel>48<fcel>68<fcel>27<nl><rhed>DocLayNet (DLN)<rhed>Figure<fcel>67<fcel>51<fcel>72<nl><ucel><rhed>Sec-header<fcel>53<fcel>-<fcel>68<nl><ucel><rhed>Table<fcel>87<fcel>43<fcel>82<nl><ucel><rhed>Text<fcel>77<fcel>-<fcel>84<nl><ucel><rhed>total<fcel>59<fcel>47<fcel>78<nl></otsl>
|
||||
<otsl><loc_59><loc_109><loc_225><loc_215><ecel><ecel><ched>Testing on<lcel><lcel><nl><ched>Training on<ched>labels<ched>PLN<ched>DB<ched>DLN<nl><rhed>PubLayNet (PLN)<rhed>Figure<fcel>96<fcel>43<fcel>23<nl><ucel><rhed>Sec-header<fcel>87<fcel>-<fcel>32<nl><ecel><rhed>Table<fcel>95<fcel>24<fcel>49<nl><ecel><rhed>Text<fcel>96<fcel>-<fcel>42<nl><ecel><rhed>total<fcel>93<fcel>34<fcel>30<nl><rhed>DocBank (DB)<rhed>Figure<fcel>77<fcel>71<fcel>31<nl><ucel><rhed>Table<fcel>19<fcel>65<fcel>22<nl><ucel><rhed>total<fcel>48<fcel>68<fcel>27<nl><rhed>DocLayNet (DLN)<rhed>Figure<fcel>67<fcel>51<fcel>72<nl><ucel><rhed>Sec-header<fcel>53<fcel>-<fcel>68<nl><ecel><rhed>Table<fcel>87<fcel>43<fcel>82<nl><ecel><rhed>Text<fcel>77<fcel>-<fcel>84<nl><ecel><rhed>total<fcel>59<fcel>47<fcel>78<nl></otsl>
|
||||
<text><loc_44><loc_247><loc_240><loc_280>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</text>
|
||||
<text><loc_44><loc_281><loc_241><loc_370>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</text>
|
||||
<section_header_level_1><loc_44><loc_382><loc_127><loc_388>Example Predictions</section_header_level_1>
|
||||
|
File diff suppressed because one or more lines are too long
@ -97,21 +97,21 @@ The annotation campaign was carried out in four phases. In phase one, we identif
|
||||
|
||||
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.
|
||||
|
||||
| | | % of Total | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|
||||
|----------------|---------|--------------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
|
||||
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
|
||||
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
|
||||
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
|
||||
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
|
||||
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
|
||||
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
|
||||
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
|
||||
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
|
||||
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
|
||||
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
|
||||
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
|
||||
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
|
||||
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
|
||||
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|
||||
|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
|
||||
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
|
||||
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
|
||||
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
|
||||
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
|
||||
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
|
||||
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
|
||||
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
|
||||
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
|
||||
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
|
||||
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
|
||||
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
|
||||
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
|
||||
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
|
||||
|
||||
Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
|
||||
|
||||
@ -154,7 +154,7 @@ Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on D
|
||||
|
||||
| | human | MRCNN | MRCNN | FRCNN | YOLO |
|
||||
|----------------|---------|---------|---------|---------|--------|
|
||||
| | human | R50 | R101 | R101 | v5x6 |
|
||||
| | | R50 | R101 | R101 | v5x6 |
|
||||
| Caption | 84-89 | 68.4 | 71.5 | 70.1 | 77.7 |
|
||||
| Footnote | 83-91 | 70.9 | 71.8 | 73.7 | 77.2 |
|
||||
| Formula | 83-85 | 60.1 | 63.4 | 63.5 | 66.2 |
|
||||
@ -246,17 +246,17 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
|
||||
| Training on | labels | PLN | DB | DLN |
|
||||
| PubLayNet (PLN) | Figure | 96 | 43 | 23 |
|
||||
| PubLayNet (PLN) | Sec-header | 87 | - | 32 |
|
||||
| PubLayNet (PLN) | Table | 95 | 24 | 49 |
|
||||
| PubLayNet (PLN) | Text | 96 | - | 42 |
|
||||
| PubLayNet (PLN) | total | 93 | 34 | 30 |
|
||||
| | Table | 95 | 24 | 49 |
|
||||
| | Text | 96 | - | 42 |
|
||||
| | total | 93 | 34 | 30 |
|
||||
| DocBank (DB) | Figure | 77 | 71 | 31 |
|
||||
| DocBank (DB) | Table | 19 | 65 | 22 |
|
||||
| DocBank (DB) | total | 48 | 68 | 27 |
|
||||
| DocLayNet (DLN) | Figure | 67 | 51 | 72 |
|
||||
| DocLayNet (DLN) | Sec-header | 53 | - | 68 |
|
||||
| DocLayNet (DLN) | Table | 87 | 43 | 82 |
|
||||
| DocLayNet (DLN) | Text | 77 | - | 84 |
|
||||
| DocLayNet (DLN) | total | 59 | 47 | 78 |
|
||||
| | Table | 87 | 43 | 82 |
|
||||
| | Text | 77 | - | 84 |
|
||||
| | total | 59 | 47 | 78 |
|
||||
|
||||
Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -3,7 +3,7 @@
|
||||
<text><loc_110><loc_74><loc_393><loc_97>order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.</text>
|
||||
<section_header_level_1><loc_110><loc_105><loc_260><loc_113>5.1 Hyper Parameter Optimization</section_header_level_1>
|
||||
<text><loc_110><loc_116><loc_393><loc_161>We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.</text>
|
||||
<otsl><loc_114><loc_213><loc_388><loc_296><ched>#<ched>#<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ched>enc-layers<ched>dec-layers<ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938<fcel>0.904<fcel>0.927<fcel>0.853<fcel>1.97<nl><ecel><ecel><fcel>OTSL<fcel>0.952 0.923<fcel>0.909<fcel>0.938<fcel>0.843<fcel>3.77<nl><fcel>2<fcel>4<fcel>HTML<fcel>0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_172><loc_393><loc_207>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
|
||||
<otsl><loc_114><loc_213><loc_388><loc_296><ched># enc-layers<ched># dec-layers<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ucel><ucel><ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904 0.909<fcel>0.927 0.938<fcel>0.853 0.843<fcel>1.97 3.77<nl><fcel>2<fcel>4<fcel>OTSL HTML<fcel>0.923 0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_172><loc_393><loc_207>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
|
||||
<section_header_level_1><loc_110><loc_319><loc_216><loc_327>5.2 Quantitative Results</section_header_level_1>
|
||||
<text><loc_110><loc_330><loc_393><loc_390>We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.</text>
|
||||
<text><loc_110><loc_390><loc_393><loc_421>Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.</text>
|
||||
|
File diff suppressed because one or more lines are too long
@ -6,14 +6,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly
|
||||
|
||||
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
|
||||
|
||||
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|
||||
|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
|
||||
| enc-layers | dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
|
||||
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| 4 | 4 | OTSL HTML | 0.938 | 0.904 | 0.927 | 0.853 | 1.97 |
|
||||
| | | OTSL | 0.952 0.923 | 0.909 | 0.938 | 0.843 | 3.77 |
|
||||
| 2 | 4 | HTML | 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
|
||||
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
|
||||
| # enc-layers | # dec-layers | Language | TEDs | TEDs | TEDs | mAP | Inference |
|
||||
|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
|
||||
| # enc-layers | # dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
|
||||
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77 |
|
||||
| 2 | 4 | OTSL HTML | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
|
||||
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
|
||||
|
||||
## 5.2 Quantitative Results
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -89,14 +89,14 @@
|
||||
<text><loc_110><loc_75><loc_393><loc_96>order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.</text>
|
||||
<section_header_level_1><loc_110><loc_107><loc_260><loc_112>5.1 Hyper Parameter Optimization</section_header_level_1>
|
||||
<text><loc_110><loc_117><loc_393><loc_160>We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.</text>
|
||||
<otsl><loc_114><loc_213><loc_388><loc_296><ched>#<ched>#<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ched>enc-layers<ched>dec-layers<ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904<fcel>0.927<fcel>0.853<fcel>1.97<nl><fcel>2<fcel>4<fcel>OTSL<fcel>0.923 0.945<fcel>0.909 0.897<fcel>0.938<fcel>0.843<fcel>3.77<nl><ecel><ecel><fcel>HTML<ecel><fcel>0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_174><loc_393><loc_206>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
|
||||
<otsl><loc_114><loc_213><loc_388><loc_296><ched># enc-layers<ched># dec-layers<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ucel><ucel><ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904 0.909<fcel>0.927 0.938<fcel>0.853 0.843<fcel>1.97 3.77<nl><fcel>2<fcel>4<fcel>OTSL HTML<fcel>0.923 0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_174><loc_393><loc_206>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
|
||||
<section_header_level_1><loc_110><loc_321><loc_216><loc_326>5.2 Quantitative Results</section_header_level_1>
|
||||
<text><loc_110><loc_331><loc_393><loc_390>We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.</text>
|
||||
<text><loc_110><loc_392><loc_393><loc_420>Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.</text>
|
||||
<page_break>
|
||||
<page_header><loc_110><loc_59><loc_118><loc_64>10</page_header>
|
||||
<page_header><loc_137><loc_59><loc_189><loc_64>M. Lysak, et al.</page_header>
|
||||
<otsl><loc_117><loc_99><loc_385><loc_166><ecel><ched>Language<ched>TEDs<lcel><lcel><ched>mAP(0.75)<ched>Inference time (secs)<nl><ecel><ucel><ched>simple<ched>complex<ched>all<ucel><ucel><nl><rhed>PubTabNet<rhed>OTSL<fcel>0.965<fcel>0.934<fcel>0.955<fcel>0.88<fcel>2.73<nl><ucel><rhed>HTML<fcel>0.969<fcel>0.927<fcel>0.955<fcel>0.857<fcel>5.39<nl><rhed>FinTabNet<rhed>OTSL<fcel>0.955<fcel>0.961<fcel>0.959<fcel>0.862<fcel>1.85<nl><ucel><rhed>HTML<fcel>0.917<fcel>0.922<fcel>0.92<fcel>0.722<fcel>3.26<nl><rhed>PubTables-1M<rhed>OTSL<fcel>0.987<fcel>0.964<fcel>0.977<fcel>0.896<fcel>1.79<nl><ucel><rhed>HTML<fcel>0.983<fcel>0.944<fcel>0.966<fcel>0.889<fcel>3.26<nl><caption><loc_110><loc_73><loc_393><loc_92>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption></otsl>
|
||||
<otsl><loc_117><loc_99><loc_385><loc_166><ched>Data set<ched>Language<ched>TEDs<lcel><lcel><ched>mAP(0.75)<ched>Inference time (secs)<nl><ucel><ucel><ched>simple<ched>complex<ched>all<ucel><ucel><nl><fcel>PubTabNet<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>FinTabNet<fcel>OTSL HTML<fcel>0.955 0.917<fcel>0.961 0.922<fcel>0.959 0.92<fcel>0.862 0.722<fcel>1.85 3.26<nl><fcel>PubTables-1M<fcel>OTSL HTML<fcel>0.987 0.983<fcel>0.964 0.944<fcel>0.977 0.966<fcel>0.896 0.889<fcel>1.79 3.26<nl><caption><loc_110><loc_73><loc_393><loc_92>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption></otsl>
|
||||
<section_header_level_1><loc_110><loc_182><loc_210><loc_188>5.3 Qualitative Results</section_header_level_1>
|
||||
<text><loc_110><loc_196><loc_393><loc_231>To illustrate the qualitative differences between OTSL and HTML, Figure 5 demonstrates less overlap and more accurate bounding boxes with OTSL. In Figure 6, OTSL proves to be more effective in handling tables with longer token sequences, resulting in even more precise structure prediction and bounding boxes.</text>
|
||||
<picture><loc_133><loc_281><loc_369><loc_419><caption><loc_110><loc_251><loc_393><loc_278>Fig. 5. The OTSL model produces more accurate bounding boxes with less overlap (E) than the HTML model (D), when predicting the structure of a sparse table (A), at twice the inference speed because of shorter sequence length (B),(C). "PMC2807444_006_00.png" PubTabNet. μ</caption></picture>
|
||||
|
File diff suppressed because one or more lines are too long
@ -126,14 +126,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly
|
||||
|
||||
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
|
||||
|
||||
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|
||||
|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
|
||||
| enc-layers | dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
|
||||
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 | 0.927 | 0.853 | 1.97 |
|
||||
| 2 | 4 | OTSL | 0.923 0.945 | 0.909 0.897 | 0.938 | 0.843 | 3.77 |
|
||||
| | | HTML | | 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
|
||||
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
|
||||
| # enc-layers | # dec-layers | Language | TEDs | TEDs | TEDs | mAP | Inference |
|
||||
|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
|
||||
| # enc-layers | # dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
|
||||
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77 |
|
||||
| 2 | 4 | OTSL HTML | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
|
||||
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
|
||||
|
||||
## 5.2 Quantitative Results
|
||||
|
||||
@ -143,15 +142,12 @@ Additionally, the results show that OTSL has an advantage over HTML when applied
|
||||
|
||||
Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).
|
||||
|
||||
| | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
|
||||
|--------------|------------|--------|---------|--------|-------------|-------------------------|
|
||||
| | Language | simple | complex | all | mAP(0.75) | Inference time (secs) |
|
||||
| PubTabNet | OTSL | 0.965 | 0.934 | 0.955 | 0.88 | 2.73 |
|
||||
| PubTabNet | HTML | 0.969 | 0.927 | 0.955 | 0.857 | 5.39 |
|
||||
| FinTabNet | OTSL | 0.955 | 0.961 | 0.959 | 0.862 | 1.85 |
|
||||
| FinTabNet | HTML | 0.917 | 0.922 | 0.92 | 0.722 | 3.26 |
|
||||
| PubTables-1M | OTSL | 0.987 | 0.964 | 0.977 | 0.896 | 1.79 |
|
||||
| PubTables-1M | HTML | 0.983 | 0.944 | 0.966 | 0.889 | 3.26 |
|
||||
| Data set | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
|
||||
|--------------|------------|-------------|-------------|-------------|-------------|-------------------------|
|
||||
| Data set | Language | simple | complex | all | mAP(0.75) | Inference time (secs) |
|
||||
| PubTabNet | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
|
||||
| FinTabNet | OTSL HTML | 0.955 0.917 | 0.961 0.922 | 0.959 0.92 | 0.862 0.722 | 1.85 3.26 |
|
||||
| PubTables-1M | OTSL HTML | 0.987 0.983 | 0.964 0.944 | 0.977 0.966 | 0.896 0.889 | 1.79 3.26 |
|
||||
|
||||
## 5.3 Qualitative Results
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -6,7 +6,7 @@ Empty unordered list:
|
||||
|
||||
Ordered list:
|
||||
|
||||
- bar
|
||||
1. bar
|
||||
|
||||
Empty ordered list:
|
||||
|
||||
|
@ -1,7 +1,7 @@
|
||||
<doctag><section_header_level_1><loc_109><loc_79><loc_258><loc_87>JavaScript Code Example</section_header_level_1>
|
||||
<text><loc_109><loc_94><loc_390><loc_183>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
|
||||
<text><loc_109><loc_185><loc_390><loc_213>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,</text>
|
||||
<code<loc_110><loc_231><loc_215><loc_257><_unknown_>function add(a, b) { return a + b; } console.log(add(3, 5));</code
|
||||
<code><loc_110><loc_231><loc_215><loc_257><_unknown_>function add(a, b) { return a + b; } console.log(add(3, 5));</code>
|
||||
<caption><loc_182><loc_221><loc_317><loc_226>Listing 1: Simple JavaScript Program</caption>
|
||||
<text><loc_109><loc_265><loc_390><loc_353>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
|
||||
<text><loc_109><loc_355><loc_390><loc_383>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,</text>
|
||||
|
@ -51,7 +51,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -63,7 +63,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -75,7 +75,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -87,7 +87,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "4",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -296,7 +296,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -308,7 +308,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -320,7 +320,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -332,7 +332,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "4",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
@ -51,7 +51,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Index",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -63,7 +63,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Customer Id",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -75,7 +75,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "First Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -87,7 +87,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "Last Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -99,7 +99,7 @@
|
||||
"start_col_offset_idx": 4,
|
||||
"end_col_offset_idx": 5,
|
||||
"text": "Company",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -111,7 +111,7 @@
|
||||
"start_col_offset_idx": 5,
|
||||
"end_col_offset_idx": 6,
|
||||
"text": "City",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -123,7 +123,7 @@
|
||||
"start_col_offset_idx": 6,
|
||||
"end_col_offset_idx": 7,
|
||||
"text": "Country",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -135,7 +135,7 @@
|
||||
"start_col_offset_idx": 7,
|
||||
"end_col_offset_idx": 8,
|
||||
"text": "Phone 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -147,7 +147,7 @@
|
||||
"start_col_offset_idx": 8,
|
||||
"end_col_offset_idx": 9,
|
||||
"text": "Phone 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -159,7 +159,7 @@
|
||||
"start_col_offset_idx": 9,
|
||||
"end_col_offset_idx": 10,
|
||||
"text": "Email",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -171,7 +171,7 @@
|
||||
"start_col_offset_idx": 10,
|
||||
"end_col_offset_idx": 11,
|
||||
"text": "Subscription Date",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -183,7 +183,7 @@
|
||||
"start_col_offset_idx": 11,
|
||||
"end_col_offset_idx": 12,
|
||||
"text": "Website",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -920,7 +920,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Index",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -932,7 +932,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Customer Id",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -944,7 +944,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "First Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -956,7 +956,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "Last Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -968,7 +968,7 @@
|
||||
"start_col_offset_idx": 4,
|
||||
"end_col_offset_idx": 5,
|
||||
"text": "Company",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -980,7 +980,7 @@
|
||||
"start_col_offset_idx": 5,
|
||||
"end_col_offset_idx": 6,
|
||||
"text": "City",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -992,7 +992,7 @@
|
||||
"start_col_offset_idx": 6,
|
||||
"end_col_offset_idx": 7,
|
||||
"text": "Country",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1004,7 +1004,7 @@
|
||||
"start_col_offset_idx": 7,
|
||||
"end_col_offset_idx": 8,
|
||||
"text": "Phone 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1016,7 +1016,7 @@
|
||||
"start_col_offset_idx": 8,
|
||||
"end_col_offset_idx": 9,
|
||||
"text": "Phone 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1028,7 +1028,7 @@
|
||||
"start_col_offset_idx": 9,
|
||||
"end_col_offset_idx": 10,
|
||||
"text": "Email",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1040,7 +1040,7 @@
|
||||
"start_col_offset_idx": 10,
|
||||
"end_col_offset_idx": 11,
|
||||
"text": "Subscription Date",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1052,7 +1052,7 @@
|
||||
"start_col_offset_idx": 11,
|
||||
"end_col_offset_idx": 12,
|
||||
"text": "Website",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
@ -51,7 +51,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -63,7 +63,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -75,7 +75,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -284,7 +284,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -296,7 +296,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -308,7 +308,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
|
@ -51,7 +51,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Index",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -63,7 +63,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Customer Id",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -75,7 +75,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "First Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -87,7 +87,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "Last Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -99,7 +99,7 @@
|
||||
"start_col_offset_idx": 4,
|
||||
"end_col_offset_idx": 5,
|
||||
"text": "Company",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -111,7 +111,7 @@
|
||||
"start_col_offset_idx": 5,
|
||||
"end_col_offset_idx": 6,
|
||||
"text": "City",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -123,7 +123,7 @@
|
||||
"start_col_offset_idx": 6,
|
||||
"end_col_offset_idx": 7,
|
||||
"text": "Country",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -135,7 +135,7 @@
|
||||
"start_col_offset_idx": 7,
|
||||
"end_col_offset_idx": 8,
|
||||
"text": "Phone 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -147,7 +147,7 @@
|
||||
"start_col_offset_idx": 8,
|
||||
"end_col_offset_idx": 9,
|
||||
"text": "Phone 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -159,7 +159,7 @@
|
||||
"start_col_offset_idx": 9,
|
||||
"end_col_offset_idx": 10,
|
||||
"text": "Email",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -171,7 +171,7 @@
|
||||
"start_col_offset_idx": 10,
|
||||
"end_col_offset_idx": 11,
|
||||
"text": "Subscription Date",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -183,7 +183,7 @@
|
||||
"start_col_offset_idx": 11,
|
||||
"end_col_offset_idx": 12,
|
||||
"text": "Website",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -920,7 +920,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Index",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -932,7 +932,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Customer Id",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -944,7 +944,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "First Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -956,7 +956,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "Last Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -968,7 +968,7 @@
|
||||
"start_col_offset_idx": 4,
|
||||
"end_col_offset_idx": 5,
|
||||
"text": "Company",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -980,7 +980,7 @@
|
||||
"start_col_offset_idx": 5,
|
||||
"end_col_offset_idx": 6,
|
||||
"text": "City",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -992,7 +992,7 @@
|
||||
"start_col_offset_idx": 6,
|
||||
"end_col_offset_idx": 7,
|
||||
"text": "Country",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1004,7 +1004,7 @@
|
||||
"start_col_offset_idx": 7,
|
||||
"end_col_offset_idx": 8,
|
||||
"text": "Phone 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1016,7 +1016,7 @@
|
||||
"start_col_offset_idx": 8,
|
||||
"end_col_offset_idx": 9,
|
||||
"text": "Phone 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1028,7 +1028,7 @@
|
||||
"start_col_offset_idx": 9,
|
||||
"end_col_offset_idx": 10,
|
||||
"text": "Email",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1040,7 +1040,7 @@
|
||||
"start_col_offset_idx": 10,
|
||||
"end_col_offset_idx": 11,
|
||||
"text": "Subscription Date",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1052,7 +1052,7 @@
|
||||
"start_col_offset_idx": 11,
|
||||
"end_col_offset_idx": 12,
|
||||
"text": "Website",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
@ -51,7 +51,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Index",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -63,7 +63,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Customer Id",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -75,7 +75,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "First Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -87,7 +87,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "Last Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -99,7 +99,7 @@
|
||||
"start_col_offset_idx": 4,
|
||||
"end_col_offset_idx": 5,
|
||||
"text": "Company",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -111,7 +111,7 @@
|
||||
"start_col_offset_idx": 5,
|
||||
"end_col_offset_idx": 6,
|
||||
"text": "City",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -123,7 +123,7 @@
|
||||
"start_col_offset_idx": 6,
|
||||
"end_col_offset_idx": 7,
|
||||
"text": "Country",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -135,7 +135,7 @@
|
||||
"start_col_offset_idx": 7,
|
||||
"end_col_offset_idx": 8,
|
||||
"text": "Phone 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -147,7 +147,7 @@
|
||||
"start_col_offset_idx": 8,
|
||||
"end_col_offset_idx": 9,
|
||||
"text": "Phone 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -159,7 +159,7 @@
|
||||
"start_col_offset_idx": 9,
|
||||
"end_col_offset_idx": 10,
|
||||
"text": "Email",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -171,7 +171,7 @@
|
||||
"start_col_offset_idx": 10,
|
||||
"end_col_offset_idx": 11,
|
||||
"text": "Subscription Date",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -183,7 +183,7 @@
|
||||
"start_col_offset_idx": 11,
|
||||
"end_col_offset_idx": 12,
|
||||
"text": "Website",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -920,7 +920,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Index",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -932,7 +932,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Customer Id",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -944,7 +944,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "First Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -956,7 +956,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "Last Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -968,7 +968,7 @@
|
||||
"start_col_offset_idx": 4,
|
||||
"end_col_offset_idx": 5,
|
||||
"text": "Company",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -980,7 +980,7 @@
|
||||
"start_col_offset_idx": 5,
|
||||
"end_col_offset_idx": 6,
|
||||
"text": "City",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -992,7 +992,7 @@
|
||||
"start_col_offset_idx": 6,
|
||||
"end_col_offset_idx": 7,
|
||||
"text": "Country",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1004,7 +1004,7 @@
|
||||
"start_col_offset_idx": 7,
|
||||
"end_col_offset_idx": 8,
|
||||
"text": "Phone 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1016,7 +1016,7 @@
|
||||
"start_col_offset_idx": 8,
|
||||
"end_col_offset_idx": 9,
|
||||
"text": "Phone 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1028,7 +1028,7 @@
|
||||
"start_col_offset_idx": 9,
|
||||
"end_col_offset_idx": 10,
|
||||
"text": "Email",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1040,7 +1040,7 @@
|
||||
"start_col_offset_idx": 10,
|
||||
"end_col_offset_idx": 11,
|
||||
"text": "Subscription Date",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1052,7 +1052,7 @@
|
||||
"start_col_offset_idx": 11,
|
||||
"end_col_offset_idx": 12,
|
||||
"text": "Website",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
@ -51,7 +51,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Index",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -63,7 +63,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Customer Id",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -75,7 +75,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "First Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -87,7 +87,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "Last Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -99,7 +99,7 @@
|
||||
"start_col_offset_idx": 4,
|
||||
"end_col_offset_idx": 5,
|
||||
"text": "Company",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -111,7 +111,7 @@
|
||||
"start_col_offset_idx": 5,
|
||||
"end_col_offset_idx": 6,
|
||||
"text": "City",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -123,7 +123,7 @@
|
||||
"start_col_offset_idx": 6,
|
||||
"end_col_offset_idx": 7,
|
||||
"text": "Country",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -135,7 +135,7 @@
|
||||
"start_col_offset_idx": 7,
|
||||
"end_col_offset_idx": 8,
|
||||
"text": "Phone 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -147,7 +147,7 @@
|
||||
"start_col_offset_idx": 8,
|
||||
"end_col_offset_idx": 9,
|
||||
"text": "Phone 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -159,7 +159,7 @@
|
||||
"start_col_offset_idx": 9,
|
||||
"end_col_offset_idx": 10,
|
||||
"text": "Email",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -171,7 +171,7 @@
|
||||
"start_col_offset_idx": 10,
|
||||
"end_col_offset_idx": 11,
|
||||
"text": "Subscription Date",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -183,7 +183,7 @@
|
||||
"start_col_offset_idx": 11,
|
||||
"end_col_offset_idx": 12,
|
||||
"text": "Website",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -920,7 +920,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Index",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -932,7 +932,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Customer Id",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -944,7 +944,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "First Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -956,7 +956,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "Last Name",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -968,7 +968,7 @@
|
||||
"start_col_offset_idx": 4,
|
||||
"end_col_offset_idx": 5,
|
||||
"text": "Company",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -980,7 +980,7 @@
|
||||
"start_col_offset_idx": 5,
|
||||
"end_col_offset_idx": 6,
|
||||
"text": "City",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -992,7 +992,7 @@
|
||||
"start_col_offset_idx": 6,
|
||||
"end_col_offset_idx": 7,
|
||||
"text": "Country",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1004,7 +1004,7 @@
|
||||
"start_col_offset_idx": 7,
|
||||
"end_col_offset_idx": 8,
|
||||
"text": "Phone 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1016,7 +1016,7 @@
|
||||
"start_col_offset_idx": 8,
|
||||
"end_col_offset_idx": 9,
|
||||
"text": "Phone 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1028,7 +1028,7 @@
|
||||
"start_col_offset_idx": 9,
|
||||
"end_col_offset_idx": 10,
|
||||
"text": "Email",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1040,7 +1040,7 @@
|
||||
"start_col_offset_idx": 10,
|
||||
"end_col_offset_idx": 11,
|
||||
"text": "Subscription Date",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -1052,7 +1052,7 @@
|
||||
"start_col_offset_idx": 11,
|
||||
"end_col_offset_idx": 12,
|
||||
"text": "Website",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
@ -51,7 +51,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -63,7 +63,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -75,7 +75,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -87,7 +87,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "4",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -284,7 +284,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -296,7 +296,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -308,7 +308,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -320,7 +320,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "4",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
@ -51,7 +51,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -63,7 +63,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -75,7 +75,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -87,7 +87,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "4",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -308,7 +308,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -320,7 +320,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -332,7 +332,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -344,7 +344,7 @@
|
||||
"start_col_offset_idx": 3,
|
||||
"end_col_offset_idx": 4,
|
||||
"text": "4",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
|
40
tests/data/groundtruth/docling_v2/equations.docx.itxt
Normal file
40
tests/data/groundtruth/docling_v2/equations.docx.itxt
Normal file
@ -0,0 +1,40 @@
|
||||
item-0 at level 0: unspecified: group _root_
|
||||
item-1 at level 1: inline: group group
|
||||
item-2 at level 2: paragraph: This is a word document and this is an inline equation:
|
||||
item-3 at level 2: formula: A= \pi r^{2}
|
||||
item-4 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
|
||||
item-5 at level 1: paragraph:
|
||||
item-6 at level 1: formula: a^{2}+b^{2}=c^{2} \text{ \texttimes } 23
|
||||
item-7 at level 1: paragraph: And that is an equation by itself. Cheers!
|
||||
item-8 at level 1: paragraph:
|
||||
item-9 at level 1: paragraph: This is another equation:
|
||||
item-10 at level 1: formula: f\left(x\right)=a_{0}+\sum_{n=1} ... })+b_{n}\sin(\frac{n \pi x}{L})\right)
|
||||
item-11 at level 1: paragraph:
|
||||
item-12 at level 1: paragraph: This is text. This is text. This ... s is text. This is text. This is text.
|
||||
item-13 at level 1: paragraph:
|
||||
item-14 at level 1: paragraph:
|
||||
item-15 at level 1: inline: group group
|
||||
item-16 at level 2: paragraph: This is a word document and this is an inline equation:
|
||||
item-17 at level 2: formula: A= \pi r^{2}
|
||||
item-18 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
|
||||
item-19 at level 1: paragraph:
|
||||
item-20 at level 1: formula: \left(x+a\right)^{n}=\sum_{k=0}^ ... ac{}{}{0pt}{}{n}{k}\right)x^{k}a^{n-k}
|
||||
item-21 at level 1: paragraph:
|
||||
item-22 at level 1: paragraph: And that is an equation by itself. Cheers!
|
||||
item-23 at level 1: paragraph:
|
||||
item-24 at level 1: paragraph: This is another equation:
|
||||
item-25 at level 1: paragraph:
|
||||
item-26 at level 1: formula: \left(1+x\right)^{n}=1+\frac{nx} ... ght)x^{2}}{2!}+ \text{ \textellipsis }
|
||||
item-27 at level 1: paragraph:
|
||||
item-28 at level 1: paragraph: This is text. This is text. This ... s is text. This is text. This is text.
|
||||
item-29 at level 1: paragraph:
|
||||
item-30 at level 1: paragraph:
|
||||
item-31 at level 1: inline: group group
|
||||
item-32 at level 2: paragraph: This is a word document and this is an inline equation:
|
||||
item-33 at level 2: formula: A= \pi r^{2}
|
||||
item-34 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
|
||||
item-35 at level 1: paragraph:
|
||||
item-36 at level 1: formula: e^{x}=1+\frac{x}{1!}+\frac{x^{2} ... xtellipsis } , - \infty < x < \infty
|
||||
item-37 at level 1: paragraph:
|
||||
item-38 at level 1: paragraph: And that is an equation by itself. Cheers!
|
||||
item-39 at level 1: paragraph:
|
616
tests/data/groundtruth/docling_v2/equations.docx.json
Normal file
616
tests/data/groundtruth/docling_v2/equations.docx.json
Normal file
@ -0,0 +1,616 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.2.0",
|
||||
"name": "equations",
|
||||
"origin": {
|
||||
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
"binary_hash": 11121138535595486899,
|
||||
"filename": "equations.docx"
|
||||
},
|
||||
"furniture": {
|
||||
"self_ref": "#/furniture",
|
||||
"children": [],
|
||||
"content_layer": "furniture",
|
||||
"name": "_root_",
|
||||
"label": "unspecified"
|
||||
},
|
||||
"body": {
|
||||
"self_ref": "#/body",
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/3"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/4"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/5"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/6"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/7"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/8"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/9"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/10"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/11"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/12"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/16"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/17"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/18"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/19"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/20"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/21"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/22"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/23"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/24"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/25"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/26"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/27"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/31"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/32"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/33"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/34"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/35"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "_root_",
|
||||
"label": "unspecified"
|
||||
},
|
||||
"groups": [
|
||||
{
|
||||
"self_ref": "#/groups/0",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/0"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/1"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/2"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "group",
|
||||
"label": "inline"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/1",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/13"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/14"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/15"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "group",
|
||||
"label": "inline"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/2",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/28"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/29"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/30"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "group",
|
||||
"label": "inline"
|
||||
}
|
||||
],
|
||||
"texts": [
|
||||
{
|
||||
"self_ref": "#/texts/0",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is a word document and this is an inline equation: ",
|
||||
"text": "This is a word document and this is an inline equation: "
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/1",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "formula",
|
||||
"prov": [],
|
||||
"orig": "A= \\pi r^{2} ",
|
||||
"text": "A= \\pi r^{2} "
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/2",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": ". If instead, I want an equation by line, I can do this:",
|
||||
"text": ". If instead, I want an equation by line, I can do this:"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/3",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/4",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "formula",
|
||||
"prov": [],
|
||||
"orig": "a^{2}+b^{2}=c^{2} \\text{ \\texttimes } 23",
|
||||
"text": "a^{2}+b^{2}=c^{2} \\text{ \\texttimes } 23"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/5",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "And that is an equation by itself. Cheers!",
|
||||
"text": "And that is an equation by itself. Cheers!"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/6",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/7",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is another equation:",
|
||||
"text": "This is another equation:"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/8",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "formula",
|
||||
"prov": [],
|
||||
"orig": "f\\left(x\\right)=a_{0}+\\sum_{n=1}^{ \\infty }\\left(a_{n}\\cos(\\frac{n \\pi x}{L})+b_{n}\\sin(\\frac{n \\pi x}{L})\\right)",
|
||||
"text": "f\\left(x\\right)=a_{0}+\\sum_{n=1}^{ \\infty }\\left(a_{n}\\cos(\\frac{n \\pi x}{L})+b_{n}\\sin(\\frac{n \\pi x}{L})\\right)"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/9",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/10",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.",
|
||||
"text": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text."
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/11",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/12",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/13",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is a word document and this is an inline equation: ",
|
||||
"text": "This is a word document and this is an inline equation: "
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/14",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "formula",
|
||||
"prov": [],
|
||||
"orig": "A= \\pi r^{2} ",
|
||||
"text": "A= \\pi r^{2} "
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/15",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": ". If instead, I want an equation by line, I can do this:",
|
||||
"text": ". If instead, I want an equation by line, I can do this:"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/16",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/17",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "formula",
|
||||
"prov": [],
|
||||
"orig": "\\left(x+a\\right)^{n}=\\sum_{k=0}^{n}\\left(\\genfrac{}{}{0pt}{}{n}{k}\\right)x^{k}a^{n-k}",
|
||||
"text": "\\left(x+a\\right)^{n}=\\sum_{k=0}^{n}\\left(\\genfrac{}{}{0pt}{}{n}{k}\\right)x^{k}a^{n-k}"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/18",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/19",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "And that is an equation by itself. Cheers!",
|
||||
"text": "And that is an equation by itself. Cheers!"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/20",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/21",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is another equation:",
|
||||
"text": "This is another equation:"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/22",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/23",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "formula",
|
||||
"prov": [],
|
||||
"orig": "\\left(1+x\\right)^{n}=1+\\frac{nx}{1!}+\\frac{n\\left(n-1\\right)x^{2}}{2!}+ \\text{ \\textellipsis }",
|
||||
"text": "\\left(1+x\\right)^{n}=1+\\frac{nx}{1!}+\\frac{n\\left(n-1\\right)x^{2}}{2!}+ \\text{ \\textellipsis }"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/24",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/25",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.",
|
||||
"text": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text."
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/26",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/27",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/28",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "This is a word document and this is an inline equation: ",
|
||||
"text": "This is a word document and this is an inline equation: "
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/29",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "formula",
|
||||
"prov": [],
|
||||
"orig": "A= \\pi r^{2} ",
|
||||
"text": "A= \\pi r^{2} "
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/30",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": ". If instead, I want an equation by line, I can do this:",
|
||||
"text": ". If instead, I want an equation by line, I can do this:"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/31",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/32",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "formula",
|
||||
"prov": [],
|
||||
"orig": "e^{x}=1+\\frac{x}{1!}+\\frac{x^{2}}{2!}+\\frac{x^{3}}{3!}+ \\text{ \\textellipsis } , - \\infty < x < \\infty",
|
||||
"text": "e^{x}=1+\\frac{x}{1!}+\\frac{x^{2}}{2!}+\\frac{x^{3}}{3!}+ \\text{ \\textellipsis } , - \\infty < x < \\infty"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/33",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/34",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "And that is an equation by itself. Cheers!",
|
||||
"text": "And that is an equation by itself. Cheers!"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/35",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "",
|
||||
"text": ""
|
||||
}
|
||||
],
|
||||
"pictures": [],
|
||||
"tables": [],
|
||||
"key_value_items": [],
|
||||
"form_items": [],
|
||||
"pages": {}
|
||||
}
|
29
tests/data/groundtruth/docling_v2/equations.docx.md
Normal file
29
tests/data/groundtruth/docling_v2/equations.docx.md
Normal file
@ -0,0 +1,29 @@
|
||||
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
|
||||
|
||||
$$a^{2}+b^{2}=c^{2} \text{ \texttimes } 23$$
|
||||
|
||||
And that is an equation by itself. Cheers!
|
||||
|
||||
This is another equation:
|
||||
|
||||
$$f\left(x\right)=a_{0}+\sum_{n=1}^{ \infty }\left(a_{n}\cos(\frac{n \pi x}{L})+b_{n}\sin(\frac{n \pi x}{L})\right)$$
|
||||
|
||||
This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.
|
||||
|
||||
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
|
||||
|
||||
$$\left(x+a\right)^{n}=\sum_{k=0}^{n}\left(\genfrac{}{}{0pt}{}{n}{k}\right)x^{k}a^{n-k}$$
|
||||
|
||||
And that is an equation by itself. Cheers!
|
||||
|
||||
This is another equation:
|
||||
|
||||
$$\left(1+x\right)^{n}=1+\frac{nx}{1!}+\frac{n\left(n-1\right)x^{2}}{2!}+ \text{ \textellipsis }$$
|
||||
|
||||
This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.
|
||||
|
||||
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
|
||||
|
||||
$$e^{x}=1+\frac{x}{1!}+\frac{x^{2}}{2!}+\frac{x^{3}}{3!}+ \text{ \textellipsis } , - \infty < x < \infty$$
|
||||
|
||||
And that is an equation by itself. Cheers!
|
@ -344,7 +344,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Header 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -356,7 +356,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Header 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -368,7 +368,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "Header 3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -493,7 +493,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Header 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -505,7 +505,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 2,
|
||||
"text": "Header 2",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -517,7 +517,7 @@
|
||||
"start_col_offset_idx": 2,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "Header 3",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
@ -13,10 +13,10 @@ Some background information here.
|
||||
- Nested item 2
|
||||
- Second item in unordered list
|
||||
|
||||
1 First item in ordered list
|
||||
1. First item in ordered list
|
||||
1. Nested ordered item 1
|
||||
2. Nested ordered item 2
|
||||
2. Second item in ordered list
|
||||
3. Second item in ordered list
|
||||
|
||||
## Data Table
|
||||
|
||||
|
@ -68,7 +68,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Header 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -80,7 +80,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "Header 2 & 3 (colspan)",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -181,7 +181,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Header 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -193,7 +193,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "Header 2 & 3 (colspan)",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -205,7 +205,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "Header 2 & 3 (colspan)",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
@ -68,7 +68,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Header 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -80,7 +80,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "Header 2 & 3 (colspan)",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -181,7 +181,7 @@
|
||||
"start_col_offset_idx": 0,
|
||||
"end_col_offset_idx": 1,
|
||||
"text": "Header 1",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -193,7 +193,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "Header 2 & 3 (colspan)",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
},
|
||||
@ -205,7 +205,7 @@
|
||||
"start_col_offset_idx": 1,
|
||||
"end_col_offset_idx": 3,
|
||||
"text": "Header 2 & 3 (colspan)",
|
||||
"column_header": false,
|
||||
"column_header": true,
|
||||
"row_header": false,
|
||||
"row_section": false
|
||||
}
|
||||
|
22
tests/data/groundtruth/docling_v2/example_07.html.itxt
Normal file
22
tests/data/groundtruth/docling_v2/example_07.html.itxt
Normal file
@ -0,0 +1,22 @@
|
||||
item-0 at level 0: unspecified: group _root_
|
||||
item-1 at level 1: list: group list
|
||||
item-2 at level 2: list_item: Asia
|
||||
item-3 at level 3: list: group list
|
||||
item-4 at level 4: list_item: China
|
||||
item-5 at level 4: list_item: Japan
|
||||
item-6 at level 4: list_item: Thailand
|
||||
item-7 at level 2: list_item: Europe
|
||||
item-8 at level 3: list: group list
|
||||
item-9 at level 4: list_item: UK
|
||||
item-10 at level 4: list_item: Germany
|
||||
item-11 at level 4: list_item: Switzerland
|
||||
item-12 at level 5: list: group list
|
||||
item-13 at level 6: list: group list
|
||||
item-14 at level 7: list_item: Bern
|
||||
item-15 at level 7: list_item: Aargau
|
||||
item-16 at level 4: list_item: Italy
|
||||
item-17 at level 5: list: group list
|
||||
item-18 at level 6: list: group list
|
||||
item-19 at level 7: list_item: Piedmont
|
||||
item-20 at level 7: list_item: Liguria
|
||||
item-21 at level 2: list_item: Africa
|
374
tests/data/groundtruth/docling_v2/example_07.html.json
Normal file
374
tests/data/groundtruth/docling_v2/example_07.html.json
Normal file
@ -0,0 +1,374 @@
|
||||
{
|
||||
"schema_name": "DoclingDocument",
|
||||
"version": "1.2.0",
|
||||
"name": "example_07",
|
||||
"origin": {
|
||||
"mimetype": "text/html",
|
||||
"binary_hash": 623628706615267627,
|
||||
"filename": "example_07.html"
|
||||
},
|
||||
"furniture": {
|
||||
"self_ref": "#/furniture",
|
||||
"children": [],
|
||||
"content_layer": "furniture",
|
||||
"name": "_root_",
|
||||
"label": "unspecified"
|
||||
},
|
||||
"body": {
|
||||
"self_ref": "#/body",
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/0"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "_root_",
|
||||
"label": "unspecified"
|
||||
},
|
||||
"groups": [
|
||||
{
|
||||
"self_ref": "#/groups/0",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/0"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/4"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/13"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "list",
|
||||
"label": "list"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/1",
|
||||
"parent": {
|
||||
"$ref": "#/texts/0"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/1"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/2"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/3"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "list",
|
||||
"label": "list"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/2",
|
||||
"parent": {
|
||||
"$ref": "#/texts/4"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/5"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/6"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/7"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/10"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "list",
|
||||
"label": "list"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/3",
|
||||
"parent": {
|
||||
"$ref": "#/texts/7"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/4"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "list",
|
||||
"label": "list"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/4",
|
||||
"parent": {
|
||||
"$ref": "#/groups/3"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/8"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/9"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "list",
|
||||
"label": "list"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/5",
|
||||
"parent": {
|
||||
"$ref": "#/texts/10"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/6"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "list",
|
||||
"label": "list"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/groups/6",
|
||||
"parent": {
|
||||
"$ref": "#/groups/5"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/11"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/12"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"name": "list",
|
||||
"label": "list"
|
||||
}
|
||||
],
|
||||
"texts": [
|
||||
{
|
||||
"self_ref": "#/texts/0",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/1"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Asia",
|
||||
"text": "Asia",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/1",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "China",
|
||||
"text": "China",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/2",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Japan",
|
||||
"text": "Japan",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/3",
|
||||
"parent": {
|
||||
"$ref": "#/groups/1"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Thailand",
|
||||
"text": "Thailand",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/4",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/2"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Europe",
|
||||
"text": "Europe",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/5",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "UK",
|
||||
"text": "UK",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/6",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Germany",
|
||||
"text": "Germany",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/7",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/3"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Switzerland",
|
||||
"text": "Switzerland",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/8",
|
||||
"parent": {
|
||||
"$ref": "#/groups/4"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Bern",
|
||||
"text": "Bern",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/9",
|
||||
"parent": {
|
||||
"$ref": "#/groups/4"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Aargau",
|
||||
"text": "Aargau",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/10",
|
||||
"parent": {
|
||||
"$ref": "#/groups/2"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/groups/5"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Italy",
|
||||
"text": "Italy",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/11",
|
||||
"parent": {
|
||||
"$ref": "#/groups/6"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Piedmont",
|
||||
"text": "Piedmont",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/12",
|
||||
"parent": {
|
||||
"$ref": "#/groups/6"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Liguria",
|
||||
"text": "Liguria",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/13",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "list_item",
|
||||
"prov": [],
|
||||
"orig": "Africa",
|
||||
"text": "Africa",
|
||||
"enumerated": false,
|
||||
"marker": "-"
|
||||
}
|
||||
],
|
||||
"pictures": [],
|
||||
"tables": [],
|
||||
"key_value_items": [],
|
||||
"form_items": [],
|
||||
"pages": {}
|
||||
}
|
14
tests/data/groundtruth/docling_v2/example_07.html.md
Normal file
14
tests/data/groundtruth/docling_v2/example_07.html.md
Normal file
@ -0,0 +1,14 @@
|
||||
- Asia
|
||||
- China
|
||||
- Japan
|
||||
- Thailand
|
||||
- Europe
|
||||
- UK
|
||||
- Germany
|
||||
- Switzerland
|
||||
- Bern
|
||||
- Aargau
|
||||
- Italy
|
||||
- Piedmont
|
||||
- Liguria
|
||||
- Africa
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user