Merge remote-tracking branch 'origin/main' into feat-factory-plugins

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
Michele Dolfi 2025-03-17 18:24:14 +01:00
commit 66fe0049fb
142 changed files with 3267 additions and 1231 deletions

2
.github/SECURITY.md vendored
View File

@ -20,4 +20,4 @@ After the initial reply to your report, the security team will keep you informed
## Security Alerts
We will send announcements of security vulnerabilities and steps to remediate on the [Docling announcements](https://github.com/DS4SD/docling/discussions/categories/announcements).
We will send announcements of security vulnerabilities and steps to remediate on the [Docling announcements](https://github.com/docling-project/docling/discussions/categories/announcements).

View File

@ -10,7 +10,7 @@ on:
jobs:
build-docs:
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
uses: ./.github/workflows/docs.yml
with:
deploy: false

View File

@ -15,5 +15,5 @@ env:
jobs:
code-checks:
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
uses: ./.github/workflows/checks.yml

File diff suppressed because it is too large Load Diff

View File

@ -2,13 +2,13 @@
Our project welcomes external contributions. If you have an itch, please feel
free to scratch it.
To contribute code or documentation, please submit a [pull request](https://github.com/DS4SD/docling/pulls).
To contribute code or documentation, please submit a [pull request](https://github.com/docling-project/docling/pulls).
A good way to familiarize yourself with the codebase and contribution process is
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/DS4SD/docling/issues).
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/docling-project/docling/issues).
Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us.
For general questions or support requests, please refer to the [discussion section](https://github.com/DS4SD/docling/discussions).
For general questions or support requests, please refer to the [discussion section](https://github.com/docling-project/docling/discussions).
**Note: We appreciate your effort and want to avoid situations where a contribution
requires extensive rework (by you or by us), sits in the backlog for a long time, or
@ -16,14 +16,14 @@ cannot be accepted at all!**
### Proposing New Features
If you would like to implement a new feature, please [raise an issue](https://github.com/DS4SD/docling/issues)
If you would like to implement a new feature, please [raise an issue](https://github.com/docling-project/docling/issues)
before sending a pull request so the feature can be discussed. This is to avoid
you spending valuable time working on a feature that the project developers
are not interested in accepting into the codebase.
### Fixing Bugs
If you would like to fix a bug, please [raise an issue](https://github.com/DS4SD/docling/issues) before sending a
If you would like to fix a bug, please [raise an issue](https://github.com/docling-project/docling/issues) before sending a
pull request so it can be tracked.
### Merge Approval
@ -78,7 +78,7 @@ This project strictly adheres to using dependencies that are compatible with the
## Communication
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).

View File

@ -1,6 +1,6 @@
<p align="center">
<a href="https://github.com/ds4sd/docling">
<img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
<a href="https://github.com/docling-project/docling">
<img loading="lazy" alt="Docling" src="https://github.com/docling-project/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
</a>
</p>
@ -11,7 +11,7 @@
</p>
[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/)
[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://docling-project.github.io/docling/)
[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling)](https://pypi.org/project/docling/)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
@ -19,7 +19,7 @@
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
[![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
@ -51,7 +51,7 @@ pip install docling
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.
More [detailed installation instructions](https://docling-project.github.io/docling/installation/) are available in the docs.
## Getting started
@ -66,28 +66,28 @@ result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
```
More [advanced usage options](https://ds4sd.github.io/docling/usage/) are available in
More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
the docs.
## Documentation
Check out Docling's [documentation](https://ds4sd.github.io/docling/), for details on
Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
installation, usage, concepts, recipes, extensions, and more.
## Examples
Go hands-on with our [examples](https://ds4sd.github.io/docling/examples/),
Go hands-on with our [examples](https://docling-project.github.io/docling/examples/),
demonstrating how to address different application use cases with Docling.
## Integrations
To further accelerate your AI application development, check out Docling's native
[integrations](https://ds4sd.github.io/docling/integrations/) with popular frameworks
[integrations](https://docling-project.github.io/docling/integrations/) with popular frameworks
and tools.
## Get help and support
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
## Technical report
@ -95,7 +95,7 @@ For more details on Docling's inner workings, check out the [Docling Technical R
## Contributing
Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
Please read [Contributing to Docling](https://github.com/docling-project/docling/blob/main/CONTRIBUTING.md) for details.
## References
@ -123,6 +123,6 @@ For individual model usage, please refer to the model licenses found in the orig
Docling has been brought to you by IBM.
[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
[integrations]: https://ds4sd.github.io/docling/integrations/
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
[integrations]: https://docling-project.github.io/docling/integrations/

View File

@ -380,7 +380,7 @@ class AsciiDocBackend(DeclarativeDocumentBackend):
end_row_offset_idx=row_idx + row_span,
start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + col_span,
col_header=False,
column_header=row_idx == 0,
row_header=False,
)
data.table_cells.append(cell)

View File

@ -111,7 +111,7 @@ class CsvDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=row_idx + 1,
start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + 1,
col_header=row_idx == 0, # First row as header
column_header=row_idx == 0, # First row as header
row_header=False,
)
table_data.table_cells.append(cell)

View File

View File

View File

@ -0,0 +1,271 @@
# -*- coding: utf-8 -*-
"""
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/latex_dict.py
On 23/01/2025
"""
from __future__ import unicode_literals
CHARS = ("{", "}", "_", "^", "#", "&", "$", "%", "~")
BLANK = ""
BACKSLASH = "\\"
ALN = "&"
CHR = {
# Unicode : Latex Math Symbols
# Top accents
"\u0300": "\\grave{{{0}}}",
"\u0301": "\\acute{{{0}}}",
"\u0302": "\\hat{{{0}}}",
"\u0303": "\\tilde{{{0}}}",
"\u0304": "\\bar{{{0}}}",
"\u0305": "\\overbar{{{0}}}",
"\u0306": "\\breve{{{0}}}",
"\u0307": "\\dot{{{0}}}",
"\u0308": "\\ddot{{{0}}}",
"\u0309": "\\ovhook{{{0}}}",
"\u030a": "\\ocirc{{{0}}}}",
"\u030c": "\\check{{{0}}}}",
"\u0310": "\\candra{{{0}}}",
"\u0312": "\\oturnedcomma{{{0}}}",
"\u0315": "\\ocommatopright{{{0}}}",
"\u031a": "\\droang{{{0}}}",
"\u0338": "\\not{{{0}}}",
"\u20d0": "\\leftharpoonaccent{{{0}}}",
"\u20d1": "\\rightharpoonaccent{{{0}}}",
"\u20d2": "\\vertoverlay{{{0}}}",
"\u20d6": "\\overleftarrow{{{0}}}",
"\u20d7": "\\vec{{{0}}}",
"\u20db": "\\dddot{{{0}}}",
"\u20dc": "\\ddddot{{{0}}}",
"\u20e1": "\\overleftrightarrow{{{0}}}",
"\u20e7": "\\annuity{{{0}}}",
"\u20e9": "\\widebridgeabove{{{0}}}",
"\u20f0": "\\asteraccent{{{0}}}",
# Bottom accents
"\u0330": "\\wideutilde{{{0}}}",
"\u0331": "\\underbar{{{0}}}",
"\u20e8": "\\threeunderdot{{{0}}}",
"\u20ec": "\\underrightharpoondown{{{0}}}",
"\u20ed": "\\underleftharpoondown{{{0}}}",
"\u20ee": "\\underledtarrow{{{0}}}",
"\u20ef": "\\underrightarrow{{{0}}}",
# Over | group
"\u23b4": "\\overbracket{{{0}}}",
"\u23dc": "\\overparen{{{0}}}",
"\u23de": "\\overbrace{{{0}}}",
# Under| group
"\u23b5": "\\underbracket{{{0}}}",
"\u23dd": "\\underparen{{{0}}}",
"\u23df": "\\underbrace{{{0}}}",
}
CHR_BO = {
# Big operators,
"\u2140": "\\Bbbsum",
"\u220f": "\\prod",
"\u2210": "\\coprod",
"\u2211": "\\sum",
"\u222b": "\\int",
"\u22c0": "\\bigwedge",
"\u22c1": "\\bigvee",
"\u22c2": "\\bigcap",
"\u22c3": "\\bigcup",
"\u2a00": "\\bigodot",
"\u2a01": "\\bigoplus",
"\u2a02": "\\bigotimes",
}
T = {
"\u2192": "\\rightarrow ",
# Greek letters
"\U0001d6fc": "\\alpha ",
"\U0001d6fd": "\\beta ",
"\U0001d6fe": "\\gamma ",
"\U0001d6ff": "\\theta ",
"\U0001d700": "\\epsilon ",
"\U0001d701": "\\zeta ",
"\U0001d702": "\\eta ",
"\U0001d703": "\\theta ",
"\U0001d704": "\\iota ",
"\U0001d705": "\\kappa ",
"\U0001d706": "\\lambda ",
"\U0001d707": "\\m ",
"\U0001d708": "\\n ",
"\U0001d709": "\\xi ",
"\U0001d70a": "\\omicron ",
"\U0001d70b": "\\pi ",
"\U0001d70c": "\\rho ",
"\U0001d70d": "\\varsigma ",
"\U0001d70e": "\\sigma ",
"\U0001d70f": "\\ta ",
"\U0001d710": "\\upsilon ",
"\U0001d711": "\\phi ",
"\U0001d712": "\\chi ",
"\U0001d713": "\\psi ",
"\U0001d714": "\\omega ",
"\U0001d715": "\\partial ",
"\U0001d716": "\\varepsilon ",
"\U0001d717": "\\vartheta ",
"\U0001d718": "\\varkappa ",
"\U0001d719": "\\varphi ",
"\U0001d71a": "\\varrho ",
"\U0001d71b": "\\varpi ",
# Relation symbols
"\u2190": "\\leftarrow ",
"\u2191": "\\uparrow ",
"\u2192": "\\rightarrow ",
"\u2193": "\\downright ",
"\u2194": "\\leftrightarrow ",
"\u2195": "\\updownarrow ",
"\u2196": "\\nwarrow ",
"\u2197": "\\nearrow ",
"\u2198": "\\searrow ",
"\u2199": "\\swarrow ",
"\u22ee": "\\vdots ",
"\u22ef": "\\cdots ",
"\u22f0": "\\adots ",
"\u22f1": "\\ddots ",
"\u2260": "\\ne ",
"\u2264": "\\leq ",
"\u2265": "\\geq ",
"\u2266": "\\leqq ",
"\u2267": "\\geqq ",
"\u2268": "\\lneqq ",
"\u2269": "\\gneqq ",
"\u226a": "\\ll ",
"\u226b": "\\gg ",
"\u2208": "\\in ",
"\u2209": "\\notin ",
"\u220b": "\\ni ",
"\u220c": "\\nni ",
# Ordinary symbols
"\u221e": "\\infty ",
# Binary relations
"\u00b1": "\\pm ",
"\u2213": "\\mp ",
# Italic, Latin, uppercase
"\U0001d434": "A",
"\U0001d435": "B",
"\U0001d436": "C",
"\U0001d437": "D",
"\U0001d438": "E",
"\U0001d439": "F",
"\U0001d43a": "G",
"\U0001d43b": "H",
"\U0001d43c": "I",
"\U0001d43d": "J",
"\U0001d43e": "K",
"\U0001d43f": "L",
"\U0001d440": "M",
"\U0001d441": "N",
"\U0001d442": "O",
"\U0001d443": "P",
"\U0001d444": "Q",
"\U0001d445": "R",
"\U0001d446": "S",
"\U0001d447": "T",
"\U0001d448": "U",
"\U0001d449": "V",
"\U0001d44a": "W",
"\U0001d44b": "X",
"\U0001d44c": "Y",
"\U0001d44d": "Z",
# Italic, Latin, lowercase
"\U0001d44e": "a",
"\U0001d44f": "b",
"\U0001d450": "c",
"\U0001d451": "d",
"\U0001d452": "e",
"\U0001d453": "f",
"\U0001d454": "g",
"\U0001d456": "i",
"\U0001d457": "j",
"\U0001d458": "k",
"\U0001d459": "l",
"\U0001d45a": "m",
"\U0001d45b": "n",
"\U0001d45c": "o",
"\U0001d45d": "p",
"\U0001d45e": "q",
"\U0001d45f": "r",
"\U0001d460": "s",
"\U0001d461": "t",
"\U0001d462": "u",
"\U0001d463": "v",
"\U0001d464": "w",
"\U0001d465": "x",
"\U0001d466": "y",
"\U0001d467": "z",
}
FUNC = {
"sin": "\\sin({fe})",
"cos": "\\cos({fe})",
"tan": "\\tan({fe})",
"arcsin": "\\arcsin({fe})",
"arccos": "\\arccos({fe})",
"arctan": "\\arctan({fe})",
"arccot": "\\arccot({fe})",
"sinh": "\\sinh({fe})",
"cosh": "\\cosh({fe})",
"tanh": "\\tanh({fe})",
"coth": "\\coth({fe})",
"sec": "\\sec({fe})",
"csc": "\\csc({fe})",
}
FUNC_PLACE = "{fe}"
BRK = "\\\\"
CHR_DEFAULT = {
"ACC_VAL": "\\hat{{{0}}}",
}
POS = {
"top": "\\overline{{{0}}}", # not sure
"bot": "\\underline{{{0}}}",
}
POS_DEFAULT = {
"BAR_VAL": "\\overline{{{0}}}",
}
SUB = "_{{{0}}}"
SUP = "^{{{0}}}"
F = {
"bar": "\\frac{{{num}}}{{{den}}}",
"skw": r"^{{{num}}}/_{{{den}}}",
"noBar": "\\genfrac{{}}{{}}{{0pt}}{{}}{{{num}}}{{{den}}}",
"lin": "{{{num}}}/{{{den}}}",
}
F_DEFAULT = "\\frac{{{num}}}{{{den}}}"
D = "\\left{left}{text}\\right{right}"
D_DEFAULT = {
"left": "(",
"right": ")",
"null": ".",
}
RAD = "\\sqrt[{deg}]{{{text}}}"
RAD_DEFAULT = "\\sqrt{{{text}}}"
ARR = "{text}"
LIM_FUNC = {
"lim": "\\lim_{{{lim}}}",
"max": "\\max_{{{lim}}}",
"min": "\\min_{{{lim}}}",
}
LIM_TO = ("\\rightarrow", "\\to")
LIM_UPP = "\\overset{{{lim}}}{{{text}}}"
M = "\\begin{{matrix}}{text}\\end{{matrix}}"

View File

@ -0,0 +1,453 @@
"""
Office Math Markup Language (OMML)
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/omml.py
On 23/01/2025
"""
import lxml.etree as ET
from pylatexenc.latexencode import UnicodeToLatexEncoder
from docling.backend.docx.latex.latex_dict import (
ALN,
ARR,
BACKSLASH,
BLANK,
BRK,
CHARS,
CHR,
CHR_BO,
CHR_DEFAULT,
D_DEFAULT,
F_DEFAULT,
FUNC,
FUNC_PLACE,
LIM_FUNC,
LIM_TO,
LIM_UPP,
POS,
POS_DEFAULT,
RAD,
RAD_DEFAULT,
SUB,
SUP,
D,
F,
M,
T,
)
OMML_NS = "{http://schemas.openxmlformats.org/officeDocument/2006/math}"
def load(stream):
tree = ET.parse(stream)
for omath in tree.findall(OMML_NS + "oMath"):
yield oMath2Latex(omath)
def load_string(string):
root = ET.fromstring(string)
for omath in root.findall(OMML_NS + "oMath"):
yield oMath2Latex(omath)
def escape_latex(strs):
last = None
new_chr = []
strs = strs.replace(r"\\", "\\")
for c in strs:
if (c in CHARS) and (last != BACKSLASH):
new_chr.append(BACKSLASH + c)
else:
new_chr.append(c)
last = c
return BLANK.join(new_chr)
def get_val(key, default=None, store=CHR):
if key is not None:
return key if not store else store.get(key, key)
else:
return default
class Tag2Method(object):
def call_method(self, elm, stag=None):
getmethod = self.tag2meth.get
if stag is None:
stag = elm.tag.replace(OMML_NS, "")
method = getmethod(stag)
if method:
return method(self, elm)
else:
return None
def process_children_list(self, elm, include=None):
"""
process children of the elm,return iterable
"""
for _e in list(elm):
if OMML_NS not in _e.tag:
continue
stag = _e.tag.replace(OMML_NS, "")
if include and (stag not in include):
continue
t = self.call_method(_e, stag=stag)
if t is None:
t = self.process_unknow(_e, stag)
if t is None:
continue
yield (stag, t, _e)
def process_children_dict(self, elm, include=None):
"""
process children of the elm,return dict
"""
latex_chars = dict()
for stag, t, e in self.process_children_list(elm, include):
latex_chars[stag] = t
return latex_chars
def process_children(self, elm, include=None):
"""
process children of the elm,return string
"""
return BLANK.join(
(
t if not isinstance(t, Tag2Method) else str(t)
for stag, t, e in self.process_children_list(elm, include)
)
)
def process_unknow(self, elm, stag):
return None
class Pr(Tag2Method):
text = ""
__val_tags = ("chr", "pos", "begChr", "endChr", "type")
__innerdict = None # can't use the __dict__
""" common properties of element"""
def __init__(self, elm):
self.__innerdict = {}
self.text = self.process_children(elm)
def __str__(self):
return self.text
def __unicode__(self):
return self.__str__(self)
def __getattr__(self, name):
return self.__innerdict.get(name, None)
def do_brk(self, elm):
self.__innerdict["brk"] = BRK
return BRK
def do_common(self, elm):
stag = elm.tag.replace(OMML_NS, "")
if stag in self.__val_tags:
t = elm.get("{0}val".format(OMML_NS))
self.__innerdict[stag] = t
return None
tag2meth = {
"brk": do_brk,
"chr": do_common,
"pos": do_common,
"begChr": do_common,
"endChr": do_common,
"type": do_common,
}
class oMath2Latex(Tag2Method):
"""
Convert oMath element of omml to latex
"""
_t_dict = T
__direct_tags = ("box", "sSub", "sSup", "sSubSup", "num", "den", "deg", "e")
u = UnicodeToLatexEncoder(
replacement_latex_protection="braces-all",
unknown_char_policy="keep",
unknown_char_warning=False,
)
def __init__(self, element):
self._latex = self.process_children(element)
def __str__(self):
return self.latex.replace(" ", " ")
def __unicode__(self):
return self.__str__(self)
def process_unknow(self, elm, stag):
if stag in self.__direct_tags:
return self.process_children(elm)
elif stag[-2:] == "Pr":
return Pr(elm)
else:
return None
@property
def latex(self):
return self._latex
def do_acc(self, elm):
"""
the accent function
"""
c_dict = self.process_children_dict(elm)
latex_s = get_val(
c_dict["accPr"].chr, default=CHR_DEFAULT.get("ACC_VAL"), store=CHR
)
return latex_s.format(c_dict["e"])
def do_bar(self, elm):
"""
the bar function
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["barPr"]
latex_s = get_val(pr.pos, default=POS_DEFAULT.get("BAR_VAL"), store=POS)
return pr.text + latex_s.format(c_dict["e"])
def do_d(self, elm):
"""
the delimiter object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["dPr"]
null = D_DEFAULT.get("null")
s_val = get_val(pr.begChr, default=D_DEFAULT.get("left"), store=T)
e_val = get_val(pr.endChr, default=D_DEFAULT.get("right"), store=T)
delim = pr.text + D.format(
left=null if not s_val else escape_latex(s_val),
text=c_dict["e"],
right=null if not e_val else escape_latex(e_val),
)
return delim
def do_spre(self, elm):
"""
the Pre-Sub-Superscript object -- Not support yet
"""
pass
def do_sub(self, elm):
text = self.process_children(elm)
return SUB.format(text)
def do_sup(self, elm):
text = self.process_children(elm)
return SUP.format(text)
def do_f(self, elm):
"""
the fraction object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["fPr"]
latex_s = get_val(pr.type, default=F_DEFAULT, store=F)
return pr.text + latex_s.format(num=c_dict.get("num"), den=c_dict.get("den"))
def do_func(self, elm):
"""
the Function-Apply object (Examples:sin cos)
"""
c_dict = self.process_children_dict(elm)
func_name = c_dict.get("fName")
return func_name.replace(FUNC_PLACE, c_dict.get("e"))
def do_fname(self, elm):
"""
the func name
"""
latex_chars = []
for stag, t, e in self.process_children_list(elm):
if stag == "r":
if FUNC.get(t):
latex_chars.append(FUNC[t])
else:
raise NotSupport("Not support func %s" % t)
else:
latex_chars.append(t)
t = BLANK.join(latex_chars)
return t if FUNC_PLACE in t else t + FUNC_PLACE # do_func will replace this
def do_groupchr(self, elm):
"""
the Group-Character object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["groupChrPr"]
latex_s = get_val(pr.chr)
return pr.text + latex_s.format(c_dict["e"])
def do_rad(self, elm):
"""
the radical object
"""
c_dict = self.process_children_dict(elm)
text = c_dict.get("e")
deg_text = c_dict.get("deg")
if deg_text:
return RAD.format(deg=deg_text, text=text)
else:
return RAD_DEFAULT.format(text=text)
def do_eqarr(self, elm):
"""
the Array object
"""
return ARR.format(
text=BRK.join(
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
)
)
def do_limlow(self, elm):
"""
the Lower-Limit object
"""
t_dict = self.process_children_dict(elm, include=("e", "lim"))
latex_s = LIM_FUNC.get(t_dict["e"])
if not latex_s:
raise NotSupport("Not support lim %s" % t_dict["e"])
else:
return latex_s.format(lim=t_dict.get("lim"))
def do_limupp(self, elm):
"""
the Upper-Limit object
"""
t_dict = self.process_children_dict(elm, include=("e", "lim"))
return LIM_UPP.format(lim=t_dict.get("lim"), text=t_dict.get("e"))
def do_lim(self, elm):
"""
the lower limit of the limLow object and the upper limit of the limUpp function
"""
return self.process_children(elm).replace(LIM_TO[0], LIM_TO[1])
def do_m(self, elm):
"""
the Matrix object
"""
rows = []
for stag, t, e in self.process_children_list(elm):
if stag == "mPr":
pass
elif stag == "mr":
rows.append(t)
return M.format(text=BRK.join(rows))
def do_mr(self, elm):
"""
a single row of the matrix m
"""
return ALN.join(
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
)
def do_nary(self, elm):
"""
the n-ary object
"""
res = []
bo = ""
for stag, t, e in self.process_children_list(elm):
if stag == "naryPr":
bo = get_val(t.chr, store=CHR_BO)
else:
res.append(t)
return bo + BLANK.join(res)
def process_unicode(self, s):
# s = s if isinstance(s,unicode) else unicode(s,'utf-8')
# print(s, self._t_dict.get(s, s), unicode_to_latex(s))
# _str.append( self._t_dict.get(s, s) )
out_latex_str = self.u.unicode_to_latex(s)
# print(s, out_latex_str)
if (
s.startswith("{") is False
and out_latex_str.startswith("{")
and s.endswith("}") is False
and out_latex_str.endswith("}")
):
out_latex_str = f" {out_latex_str[1:-1]} "
# print(s, out_latex_str)
if "ensuremath" in out_latex_str:
out_latex_str = out_latex_str.replace("\\ensuremath{", " ")
out_latex_str = out_latex_str.replace("}", " ")
# print(s, out_latex_str)
if out_latex_str.strip().startswith("\\text"):
out_latex_str = f" \\text{{{out_latex_str}}} "
# print(s, out_latex_str)
return out_latex_str
def do_r(self, elm):
"""
Get text from 'r' element,And try convert them to latex symbols
@todo text style support , (sty)
@todo \text (latex pure text support)
"""
_str = []
_base_str = []
for s in elm.findtext("./{0}t".format(OMML_NS)):
out_latex_str = self.process_unicode(s)
_str.append(out_latex_str)
_base_str.append(s)
proc_str = escape_latex(BLANK.join(_str))
base_proc_str = BLANK.join(_base_str)
if "{" not in base_proc_str and "\\{" in proc_str:
proc_str = proc_str.replace("\\{", "{")
if "}" not in base_proc_str and "\\}" in proc_str:
proc_str = proc_str.replace("\\}", "}")
return proc_str
tag2meth = {
"acc": do_acc,
"r": do_r,
"bar": do_bar,
"sub": do_sub,
"sup": do_sup,
"f": do_f,
"func": do_func,
"fName": do_fname,
"groupChr": do_groupchr,
"d": do_d,
"rad": do_rad,
"eqArr": do_eqarr,
"limLow": do_limlow,
"limUpp": do_limupp,
"lim": do_lim,
"m": do_m,
"mr": do_mr,
"nary": do_nary,
}

View File

@ -134,7 +134,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
self.analyze_tag(cast(Tag, element), doc)
except Exception as exc_child:
_log.error(
f"Error processing child from tag{tag.name}: {exc_child}"
f"Error processing child from tag {tag.name}: {repr(exc_child)}"
)
raise exc_child
elif isinstance(element, NavigableString) and not isinstance(
@ -347,11 +347,11 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
content_layer=self.content_layer,
)
self.level += 1
self.walk(element, doc)
self.parents[self.level + 1] = None
self.level -= 1
self.walk(element, doc)
self.parents[self.level + 1] = None
self.level -= 1
else:
self.walk(element, doc)
elif element.text.strip():
text = element.text.strip()
@ -457,7 +457,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=row_idx + row_span,
start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + col_span,
col_header=col_header,
column_header=col_header,
row_header=((not col_header) and html_cell.name == "th"),
)
data.table_cells.append(table_cell)

View File

@ -136,7 +136,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=trow_ind + row_span,
start_col_offset_idx=tcol_ind,
end_col_offset_idx=tcol_ind + col_span,
col_header=False,
column_header=trow_ind == 0,
row_header=False,
)
tcells.append(icell)

View File

@ -164,7 +164,7 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=excel_cell.row + excel_cell.row_span,
start_col_offset_idx=excel_cell.col,
end_col_offset_idx=excel_cell.col + excel_cell.col_span,
col_header=False,
column_header=excel_cell.row == 0,
row_header=False,
)
table_data.table_cells.append(cell)
@ -173,7 +173,7 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
return doc
def _find_data_tables(self, sheet: Worksheet):
def _find_data_tables(self, sheet: Worksheet) -> List[ExcelTable]:
"""
Find all compact rectangular data tables in a sheet.
"""
@ -340,47 +340,4 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
except:
_log.error("could not extract the image from excel sheets")
"""
for idx, chart in enumerate(sheet._charts): # type: ignore
try:
chart_path = f"chart_{idx + 1}.png"
_log.info(
f"Chart found, but dynamic rendering is required for: {chart_path}"
)
_log.info(f"Chart {idx + 1}:")
# Chart type
# _log.info(f"Type: {type(chart).__name__}")
print(f"Type: {type(chart).__name__}")
# Extract series data
for series_idx, series in enumerate(chart.series):
#_log.info(f"Series {series_idx + 1}:")
print(f"Series {series_idx + 1} type: {type(series).__name__}")
#print(f"x-values: {series.xVal}")
#print(f"y-values: {series.yVal}")
print(f"xval type: {type(series.xVal).__name__}")
xvals = []
for _ in series.xVal.numLit.pt:
print(f"xval type: {type(_).__name__}")
if hasattr(_, 'v'):
xvals.append(_.v)
print(f"x-values: {xvals}")
yvals = []
for _ in series.yVal:
if hasattr(_, 'v'):
yvals.append(_.v)
print(f"y-values: {yvals}")
except Exception as exc:
print(exc)
continue
"""
return doc

View File

@ -346,7 +346,7 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
end_row_offset_idx=row_idx + row_span,
start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + col_span,
col_header=False,
column_header=row_idx == 0,
row_header=False,
)
if len(cell.text.strip()) > 0:

View File

@ -26,6 +26,7 @@ from PIL import Image, UnidentifiedImageError
from typing_extensions import override
from docling.backend.abstract_backend import DeclarativeDocumentBackend
from docling.backend.docx.latex.omml import oMath2Latex
from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument
@ -260,6 +261,25 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
else:
return label, None
def handle_equations_in_text(self, element, text):
only_texts = []
only_equations = []
texts_and_equations = []
for subt in element.iter():
tag_name = etree.QName(subt).localname
if tag_name == "t" and "math" not in subt.tag:
only_texts.append(subt.text)
texts_and_equations.append(subt.text)
elif "oMath" in subt.tag and "oMathPara" not in subt.tag:
latex_equation = str(oMath2Latex(subt))
only_equations.append(latex_equation)
texts_and_equations.append(latex_equation)
if "".join(only_texts) != text:
return text
return "".join(texts_and_equations), only_equations
def handle_text_elements(
self,
element: BaseOxmlElement,
@ -268,9 +288,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
) -> None:
paragraph = Paragraph(element, docx_obj)
if paragraph.text is None:
raw_text = paragraph.text
text, equations = self.handle_equations_in_text(element=element, text=raw_text)
if text is None:
return
text = paragraph.text.strip()
text = text.strip()
# Common styles for bullet and numbered lists.
# "List Bullet", "List Number", "List Paragraph"
@ -323,6 +346,45 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
elif "Heading" in p_style_id:
self.add_header(doc, p_level, text)
elif len(equations) > 0:
if (raw_text is None or len(raw_text) == 0) and len(text) > 0:
# Standalone equation
level = self.get_level()
doc.add_text(
label=DocItemLabel.FORMULA,
parent=self.parents[level - 1],
text=text,
)
else:
# Inline equation
level = self.get_level()
inline_equation = doc.add_group(
label=GroupLabel.INLINE, parent=self.parents[level - 1]
)
text_tmp = text
for eq in equations:
if len(text_tmp) == 0:
break
pre_eq_text = text_tmp.split(eq, maxsplit=1)[0]
text_tmp = text_tmp.split(eq, maxsplit=1)[1]
if len(pre_eq_text) > 0:
doc.add_text(
label=DocItemLabel.PARAGRAPH,
parent=inline_equation,
text=pre_eq_text,
)
doc.add_text(
label=DocItemLabel.FORMULA,
parent=inline_equation,
text=eq,
)
if len(text_tmp) > 0:
doc.add_text(
label=DocItemLabel.PARAGRAPH,
parent=inline_equation,
text=text_tmp,
)
elif p_style_id in [
"Paragraph",
"Normal",
@ -539,7 +601,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
end_row_offset_idx=row.grid_cols_before + spanned_idx,
start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + cell.grid_span,
col_header=False,
column_header=row.grid_cols_before + row_idx == 0,
row_header=False,
)
data.table_cells.append(table_cell)

View File

@ -999,7 +999,7 @@ class PatentUsptoGrantAps(PatentUspto):
parent=self.parents[self.level],
)
last_claim.text += f" {value}" if last_claim.text else value
last_claim.text += f" {value.strip()}" if last_claim.text else value.strip()
elif field == self.Field.CAPTION.value and section in (
self.Section.SUMMARY.value,

View File

@ -209,7 +209,7 @@ def convert(
table_mode: Annotated[
TableFormerMode,
typer.Option(..., help="The mode to use in the table structure model."),
] = TableFormerMode.FAST,
] = TableFormerMode.ACCURATE,
enrich_code: Annotated[
bool,
typer.Option(..., help="Enable the code enrichment model in the pipeline."),
@ -244,7 +244,7 @@ def convert(
typer.Option(
...,
"--abort-on-error/--no-abort-on-error",
help="If enabled, the bitmap content will be processed using OCR.",
help="If enabled, the processing will be aborted when the first error is encountered.",
),
] = False,
output: Annotated[

View File

@ -121,7 +121,7 @@ def download(
"Using the CLI:",
f"`docling --artifacts-path={output_dir} FILE`",
"\n",
"Using Python: see the documentation at <https://ds4sd.github.io/docling/usage>.",
"Using Python: see the documentation at <https://docling-project.github.io/docling/usage>.",
)

View File

@ -99,7 +99,7 @@ class TableStructureOptions(BaseModel):
# are merged across table columns.
# False: Let table structure model define the text cells, ignore PDF cells.
)
mode: TableFormerMode = TableFormerMode.FAST
mode: TableFormerMode = TableFormerMode.ACCURATE
class OcrOptions(BaseOptions):

View File

@ -1,4 +1,5 @@
import re
from collections import Counter
from pathlib import Path
from typing import Iterable, List, Literal, Optional, Tuple, Union
@ -11,7 +12,7 @@ from docling_core.types.doc import (
TextItem,
)
from docling_core.types.doc.labels import CodeLanguageLabel
from PIL import Image
from PIL import Image, ImageOps
from pydantic import BaseModel
from docling.datamodel.base_models import ItemAndImageEnrichmentElement
@ -65,7 +66,7 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
_model_repo_folder = "ds4sd--CodeFormula"
elements_batch_size = 5
images_scale = 1.66 # = 120 dpi, aligned with training data resolution
expansion_factor = 0.03
expansion_factor = 0.18
def __init__(
self,
@ -124,7 +125,7 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
repo_id="ds4sd/CodeFormula",
force_download=force,
local_dir=local_dir,
revision="v1.0.1",
revision="v1.0.2",
)
return Path(download_path)
@ -175,7 +176,7 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
- The second element is the extracted language if a match is found;
otherwise, `None`.
"""
pattern = r"^<_([^>]+)_>\s*(.*)"
pattern = r"^<_([^_>]+)_>\s(.*)"
match = re.match(pattern, input_string, flags=re.DOTALL)
if match:
language = str(match.group(1)) # the captured programming language
@ -206,6 +207,82 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
except ValueError:
return CodeLanguageLabel.UNKNOWN
def _get_most_frequent_edge_color(self, pil_img: Image.Image):
"""
Compute the most frequent color along the outer edges of a PIL image.
Parameters
----------
pil_img : Image.Image
A PIL Image in any mode (L, RGB, RGBA, etc.).
Returns
-------
(int) or (tuple): The most common edge color as a scalar (for grayscale) or
tuple (for RGB/RGBA).
"""
# Convert to NumPy array for easy pixel access
img_np = np.array(pil_img)
if img_np.ndim == 2:
# Grayscale-like image: shape (H, W)
# Extract edges: top row, bottom row, left col, right col
top = img_np[0, :] # shape (W,)
bottom = img_np[-1, :] # shape (W,)
left = img_np[:, 0] # shape (H,)
right = img_np[:, -1] # shape (H,)
# Concatenate all edges
edges = np.concatenate([top, bottom, left, right])
# Count frequencies
freq = Counter(edges.tolist())
most_common_value, _ = freq.most_common(1)[0]
return int(most_common_value) # single channel color
else:
# Color image: shape (H, W, C)
top = img_np[0, :, :] # shape (W, C)
bottom = img_np[-1, :, :] # shape (W, C)
left = img_np[:, 0, :] # shape (H, C)
right = img_np[:, -1, :] # shape (H, C)
# Concatenate edges along first axis
edges = np.concatenate([top, bottom, left, right], axis=0)
# Convert each color to a tuple for counting
edges_as_tuples = [tuple(pixel) for pixel in edges]
freq = Counter(edges_as_tuples)
most_common_value, _ = freq.most_common(1)[0]
return most_common_value # e.g. (R, G, B) or (R, G, B, A)
def _pad_with_most_frequent_edge_color(
self, img: Union[Image.Image, np.ndarray], padding: Tuple[int, int, int, int]
):
"""
Pads an image (PIL or NumPy array) using the most frequent edge color.
Parameters
----------
img : Union[Image.Image, np.ndarray]
The original image.
padding : tuple
Padding (left, top, right, bottom) in pixels.
Returns
-------
Image.Image: A new PIL image with the specified padding.
"""
if isinstance(img, np.ndarray):
pil_img = Image.fromarray(img)
else:
pil_img = img
most_freq_color = self._get_most_frequent_edge_color(pil_img)
padded_img = ImageOps.expand(pil_img, border=padding, fill=most_freq_color)
return padded_img
def __call__(
self,
doc: DoclingDocument,
@ -238,7 +315,9 @@ class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
assert isinstance(el.item, TextItem)
elements.append(el.item)
labels.append(el.item.label)
images.append(el.image)
images.append(
self._pad_with_most_frequent_edge_color(el.image, (20, 10, 20, 10))
)
outputs = self.code_formula_model.predict(images, labels)

View File

@ -113,7 +113,7 @@ class DocumentPictureClassifier(BaseEnrichmentModel):
repo_id="ds4sd/DocumentFigureClassifier",
force_download=force,
local_dir=local_dir,
revision="v1.0.0",
revision="v1.0.1",
)
return Path(download_path)

View File

@ -45,7 +45,7 @@ class OcrMacModel(BaseOcrModel):
"ocrmac is not correctly installed. "
"Please install it via `pip install ocrmac` to use this OCR engine. "
"Alternatively, Docling has support for other OCR engines. See the documentation: "
"https://ds4sd.github.io/docling/installation/"
"https://docling-project.github.io/docling/installation/"
)
try:
from ocrmac import ocrmac

View File

@ -95,7 +95,7 @@ class TableStructureModel(BasePageModel):
repo_id="ds4sd/docling-models",
force_download=force,
local_dir=local_dir,
revision="v2.1.0",
revision="v2.2.0",
)
return Path(download_path)

View File

@ -47,14 +47,14 @@ class TesseractOcrModel(BaseOcrModel):
"Note that tesserocr might have to be manually compiled for working with "
"your Tesseract installation. The Docling documentation provides examples for it. "
"Alternatively, Docling has support for other OCR engines. See the documentation: "
"https://ds4sd.github.io/docling/installation/"
"https://docling-project.github.io/docling/installation/"
)
missing_langs_errmsg = (
"tesserocr is not correctly configured. No language models have been detected. "
"Please ensure that the TESSDATA_PREFIX envvar points to tesseract languages dir. "
"You can find more information how to setup other OCR engines in Docling "
"documentation: "
"https://ds4sd.github.io/docling/installation/"
"https://docling-project.github.io/docling/installation/"
)
try:

View File

@ -203,6 +203,7 @@ class LayoutPostprocessor:
"""Initialize processor with cells and spatial indices."""
self.cells = cells
self.page_size = page_size
self.all_clusters = clusters
self.regular_clusters = [
c for c in clusters if c.label not in self.SPECIAL_TYPES
]
@ -267,7 +268,7 @@ class LayoutPostprocessor:
# Handle orphaned cells
unassigned = self._find_unassigned_cells(clusters)
if unassigned:
next_id = max((c.id for c in clusters), default=0) + 1
next_id = max((c.id for c in self.all_clusters), default=0) + 1
orphan_clusters = []
for i, cell in enumerate(unassigned):
conf = 1.0

View File

@ -7,7 +7,7 @@ pydantic datatype, which can express several features common to documents, such
* Layout information (i.e. bounding boxes) for all items, if available
* Provenance information
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/DS4SD/docling-core/tree/main/docling_core/types/doc).
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/docling-project/docling-core/tree/main/docling_core/types/doc).
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.

View File

@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
@ -36,7 +36,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is an example of using [Docling](https://ds4sd.github.io/docling/) for converting structured data (XML) into a unified document\n",
"This is an example of using [Docling](https://docling-project.github.io/docling/) for converting structured data (XML) into a unified document\n",
"representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n",
"\n",
"Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n",

View File

@ -103,7 +103,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://ds4sd.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
]
},
{

View File

@ -321,7 +321,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "docling-aMWN2FRM-py3.12",
"display_name": "docling-hgXEfXco-py3.12",
"language": "python",
"name": "python3"
},

View File

@ -36,7 +36,7 @@
"## A recipe 🧑‍🍳 🐥 💚\n",
"\n",
"This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:\n",
"- [Docling](https://ds4sd.github.io/docling/) for document parsing and chunking\n",
"- [Docling](https://docling-project.github.io/docling/) for document parsing and chunking\n",
"- [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search/?msockid=0109678bea39665431e37323ebff6723) for vector indexing and retrieval\n",
"- [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service?msockid=0109678bea39665431e37323ebff6723) for embeddings and chat completion\n",
"\n",

View File

@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
@ -247,7 +247,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
"/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
" warnings.warn(\n"
]
}

View File

@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
@ -168,7 +168,7 @@
"source": [
"> Note: a message saying `\"Token indices sequence length is longer than the specified\n",
"maximum sequence length...\"` can be ignored in this case — details\n",
"[here](https://github.com/DS4SD/docling-core/issues/119#issuecomment-2577418826)."
"[here](https://github.com/docling-project/docling-core/issues/119#issuecomment-2577418826)."
]
},
{

View File

@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{

View File

@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
]
},
{
@ -29,7 +29,7 @@
"\n",
"## A recipe 🧑‍🍳 🐥 💚\n",
"\n",
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://ds4sd.github.io/docling/).\n",
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://docling-project.github.io/docling/).\n",
"\n",
"In this notebook, we accomplish the following:\n",
"* Parse the top machine learning papers on [arXiv](https://arxiv.org/) using Docling\n",

View File

@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
@ -109,7 +109,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
"/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
" warnings.warn(\n"
]
}

View File

@ -1,6 +1,6 @@
# FAQ
This is a collection of FAQ collected from the user questions on <https://github.com/DS4SD/docling/discussions>.
This is a collection of FAQ collected from the user questions on <https://github.com/docling-project/docling/discussions>.
??? question "Is Python 3.13 supported?"
@ -41,7 +41,7 @@ This is a collection of FAQ collected from the user questions on <https://github
]
```
Source: Issue [#283](https://github.com/DS4SD/docling/issues/283#issuecomment-2465035868)
Source: Issue [#283](https://github.com/docling-project/docling/issues/283#issuecomment-2465035868)
??? question "Are text styles (bold, underline, etc) supported?"
@ -74,7 +74,7 @@ This is a collection of FAQ collected from the user questions on <https://github
)
```
Source: Issue [#326](https://github.com/DS4SD/docling/issues/326)
Source: Issue [#326](https://github.com/docling-project/docling/issues/326)
??? question " Which model weights are needed to run Docling?"
@ -84,7 +84,7 @@ This is a collection of FAQ collected from the user questions on <https://github
For processing PDF documents, Docling requires the model weights from <https://huggingface.co/ds4sd/docling-models>.
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/DS4SD/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/docling-project/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
??? question "SSL error downloading model weights"
@ -174,6 +174,6 @@ This is a collection of FAQ collected from the user questions on <https://github
print(f"Model max length: {tokenizer.model_max_length}")
```
Also see [docling#725](https://github.com/DS4SD/docling/issues/725).
Also see [docling#725](https://github.com/docling-project/docling/issues/725).
Source: Issue [docling-core#119](https://github.com/DS4SD/docling-core/issues/119)
Source: Issue [docling-core#119](https://github.com/docling-project/docling-core/issues/119)

View File

@ -11,7 +11,7 @@
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
[![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

View File

@ -5,7 +5,7 @@ Docling is available as a converter in [Haystack](https://haystack.deepset.ai/):
- 🧑🏽‍🍳 [Docling Haystack integration example][example]
- 📦 [Docling Haystack integration PyPI][pypi]
[github]: https://github.com/DS4SD/docling-haystack
[github]: https://github.com/docling-project/docling-haystack
[docs]: https://haystack.deepset.ai/integrations/docling
[pypi]: https://pypi.org/project/docling-haystack
[example]: ../examples/rag_haystack.ipynb

View File

@ -8,7 +8,7 @@ To get started, check out the [step-by-step guide in LangChain][guide].
- 📦 [LangChain Docling integration PyPI][pypi]
[docs]: https://python.langchain.com/docs/integrations/providers/docling/
[github]: https://github.com/DS4SD/docling-langchain
[github]: https://github.com/docling-project/docling-langchain
[guide]: https://python.langchain.com/docs/integrations/document_loaders/docling/
[example]: ../examples/rag_langchain.ipynb
[pypi]: https://pypi.org/project/langchain-docling/

View File

@ -8,7 +8,7 @@ The following table provides an overview of the default enrichment models availa
| ------- | --------- | ---------------| ----------- |
| Code understanding | `do_code_enrichment` | `CodeItem` | See [docs below](#code-understanding). |
| Formula understanding | `do_formula_enrichment` | `TextItem` with label `FORMULA` | See [docs below](#formula-understanding). |
| Picrure classification | `do_picture_classification` | `PictureItem` | See [docs below](#picture-classification). |
| Picture classification | `do_picture_classification` | `PictureItem` | See [docs below](#picture-classification). |
| Picture description | `do_picture_description` | `PictureItem` | See [docs below](#picture-description). |

View File

@ -71,6 +71,13 @@ Or using the CLI:
docling --artifacts-path="/local/path/to/models" FILE
```
Or using the `DOCLING_ARTIFACTS_PATH` environment variable:
```sh
export DOCLING_ARTIFACTS_PATH="/local/path/to/models"
python my_docling_script.py
```
#### Using remote services
The main purpose of Docling is to run local models which are not sharing any user data with remote services.
@ -128,7 +135,7 @@ doc_converter = DocumentConverter(
)
```
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.
```python
from docling.datamodel.base_models import InputFormat

View File

@ -1,7 +1,7 @@
site_name: Docling
site_url: https://ds4sd.github.io/docling/
repo_name: DS4SD/docling
repo_url: https://github.com/DS4SD/docling
site_url: https://docling-project.github.io/docling/
repo_name: docling-project/docling
repo_url: https://github.com/docling-project/docling
theme:
name: material

18
poetry.lock generated
View File

@ -870,13 +870,13 @@ files = [
[[package]]
name = "docling-core"
version = "2.20.0"
version = "2.22.0"
description = "A python library to define and validate data types in Docling."
optional = false
python-versions = "<4.0,>=3.9"
files = [
{file = "docling_core-2.20.0-py3-none-any.whl", hash = "sha256:72f50fce277b7bb51f4134f443240c041582184305c3bcaabdea13fc5550f160"},
{file = "docling_core-2.20.0.tar.gz", hash = "sha256:9733581c15f5a9b5e3a6cb74fa995cc4078ff16668007f86c5f75d1ea9180d7f"},
{file = "docling_core-2.22.0-py3-none-any.whl", hash = "sha256:d74d351024d016f46a09f171fb9d2d78809b132e18e25176af517ac4203c858c"},
{file = "docling_core-2.22.0.tar.gz", hash = "sha256:5e4bf15884560a5dc66482206f875d152701bb809f0ed52bbbe86133e0d559e2"},
]
[package.dependencies]
@ -4816,6 +4816,16 @@ files = [
[package.extras]
windows-terminal = ["colorama (>=0.4.6)"]
[[package]]
name = "pylatexenc"
version = "2.10"
description = "Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion"
optional = false
python-versions = "*"
files = [
{file = "pylatexenc-2.10.tar.gz", hash = "sha256:3dd8fd84eb46dc30bee1e23eaab8d8fb5a7f507347b23e5f38ad9675c84f40d3"},
]
[[package]]
name = "pylint"
version = "2.17.7"
@ -7851,4 +7861,4 @@ vlm = ["accelerate", "transformers", "transformers"]
[metadata]
lock-version = "2.0"
python-versions = "^3.9"
content-hash = "59424e63947e6c22fceab0a3fe6f1e9ebb72abfe369708a1f55d4daf2593f433"
content-hash = "2bd40b14b7d7ce6ed8bb4e8b0d22f498e9b42033f4382d2f655099e9af4f5f7a"

View File

@ -1,24 +1,44 @@
[tool.poetry]
name = "docling"
version = "2.25.1" # DO NOT EDIT, updated automatically
version = "2.26.0" # DO NOT EDIT, updated automatically
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
authors = [
"Christoph Auer <cau@zurich.ibm.com>",
"Michele Dolfi <dol@zurich.ibm.com>",
"Maxim Lysak <mly@zurich.ibm.com>",
"Nikos Livathinos <nli@zurich.ibm.com>",
"Ahmed Nassar <ahn@zurich.ibm.com>",
"Panos Vagenas <pva@zurich.ibm.com>",
"Peter Staar <taa@zurich.ibm.com>",
]
license = "MIT"
readme = "README.md"
repository = "https://github.com/DS4SD/docling"
homepage = "https://github.com/DS4SD/docling"
keywords= ["docling", "convert", "document", "pdf", "docx", "html", "markdown", "layout model", "segmentation", "table structure", "table former"]
classifiers = [
"License :: OSI Approved :: MIT License",
"Operating System :: MacOS :: MacOS X",
"Operating System :: POSIX :: Linux",
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Programming Language :: Python :: 3"
]
packages = [{include = "docling"}]
repository = "https://github.com/docling-project/docling"
homepage = "https://github.com/docling-project/docling"
keywords = [
"docling",
"convert",
"document",
"pdf",
"docx",
"html",
"markdown",
"layout model",
"segmentation",
"table structure",
"table former",
]
classifiers = [
"License :: OSI Approved :: MIT License",
"Operating System :: MacOS :: MacOS X",
"Operating System :: POSIX :: Linux",
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Programming Language :: Python :: 3",
]
packages = [{ include = "docling" }]
[tool.poetry.dependencies]
######################
@ -26,7 +46,7 @@ packages = [{include = "docling"}]
######################
python = "^3.9"
pydantic = "^2.0.0"
docling-core = {extras = ["chunking"], version = "^2.19.0"}
docling-core = {extras = ["chunking"], version = "^2.22.0"}
docling-ibm-models = "^3.4.0"
docling-parse = "^3.3.0"
filetype = "^1.2.0"
@ -40,7 +60,7 @@ certifi = ">=2024.7.4"
rtree = "^1.3.0"
scipy = [
{ version = "^1.6.0", markers = "python_version >= '3.10'" },
{ version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" }
{ version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" },
]
typer = "^0.12.5"
python-docx = "^1.1.2"
@ -56,22 +76,23 @@ onnxruntime = [
# 1.19.2 is the last version with python3.9 support,
# see https://github.com/microsoft/onnxruntime/releases/tag/v1.20.0
{ version = ">=1.7.0,<1.20.0", optional = true, markers = "python_version < '3.10'" },
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" }
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" },
]
transformers = [
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^4.46.0", optional = true },
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~4.42.0", optional = true }
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^4.46.0", optional = true },
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~4.42.0", optional = true },
]
accelerate = [
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^1.2.1", optional = true },
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^1.2.1", optional = true },
]
pillow = ">=10.0.0,<12.0.0"
tqdm = "^4.65.0"
pluggy = "^1.0.0"
pylatexenc = "^2.10"
[tool.poetry.group.dev.dependencies]
black = {extras = ["jupyter"], version = "^24.4.2"}
black = { extras = ["jupyter"], version = "^24.4.2" }
pytest = "^7.2.2"
pre-commit = "^3.7.1"
mypy = "^1.10.1"
@ -94,7 +115,7 @@ types-tqdm = "^4.67.0.20241221"
mkdocs-material = "^9.5.40"
mkdocs-jupyter = "^0.25.0"
mkdocs-click = "^0.8.1"
mkdocstrings = {extras = ["python"], version = "^0.27.0"}
mkdocstrings = { extras = ["python"], version = "^0.27.0" }
griffe-pydantic = "^1.1.0"
[tool.poetry.group.examples.dependencies]
@ -109,8 +130,8 @@ optional = true
[tool.poetry.group.constraints.dependencies]
numpy = [
{ version = ">=1.24.4,<3.0.0", markers = 'python_version >= "3.10"' },
{ version = ">=1.24.4,<2.1.0", markers = 'python_version < "3.10"' },
{ version = ">=1.24.4,<3.0.0", markers = 'python_version >= "3.10"' },
{ version = ">=1.24.4,<2.1.0", markers = 'python_version < "3.10"' },
]
[tool.poetry.group.mac_intel]
@ -118,12 +139,12 @@ optional = true
[tool.poetry.group.mac_intel.dependencies]
torch = [
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^2.2.2"},
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~2.2.2"}
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^2.2.2" },
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~2.2.2" },
]
torchvision = [
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^0"},
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~0.17.2"}
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^0" },
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~0.17.2" },
]
[tool.poetry.extras]
@ -151,7 +172,7 @@ include = '\.pyi?$'
[tool.isort]
profile = "black"
line_length = 88
py_version=39
py_version = 39
[tool.mypy]
pretty = true
@ -162,18 +183,19 @@ python_version = "3.10"
[[tool.mypy.overrides]]
module = [
"docling_parse.*",
"pypdfium2.*",
"networkx.*",
"scipy.*",
"filetype.*",
"tesserocr.*",
"docling_ibm_models.*",
"easyocr.*",
"ocrmac.*",
"lxml.*",
"huggingface_hub.*",
"transformers.*",
"docling_parse.*",
"pypdfium2.*",
"networkx.*",
"scipy.*",
"filetype.*",
"tesserocr.*",
"docling_ibm_models.*",
"easyocr.*",
"ocrmac.*",
"lxml.*",
"huggingface_hub.*",
"transformers.*",
"pylatexenc.*",
]
ignore_missing_imports = true

Binary file not shown.

View File

@ -12,7 +12,7 @@
</figure>
<table>
<location><page_1><loc_52><loc_62><loc_88><loc_71></location>
<row_0><col_0><col_header>3</col_0><col_1><col_header>1</col_1></row_0>
<row_0><col_0><col_header>1</col_0></row_0>
</table>
<paragraph><location><page_1><loc_52><loc_58><loc_79><loc_60></location>- b. Red-annotation of bounding boxes, Blue-predictions by TableFormer</paragraph>
<paragraph><location><page_1><loc_52><loc_46><loc_80><loc_47></location>- c. Structure predicted by TableFormer:</paragraph>
@ -25,11 +25,11 @@
</figure>
<table>
<location><page_1><loc_52><loc_37><loc_88><loc_45></location>
<row_0><col_0><col_header>0</col_0><col_1><col_header>1</col_1><col_2><col_header>1</col_2><col_3><col_header>2 1</col_3><col_4><col_header>2 1</col_4><col_5><body></col_5></row_0>
<row_1><col_0><body>3</col_0><col_1><body>4</col_1><col_2><body>5 3</col_2><col_3><body>6</col_3><col_4><body>7</col_4><col_5><body></col_5></row_1>
<row_2><col_0><body>8</col_0><col_1><body>9</col_1><col_2><body>10</col_2><col_3><body>11</col_3><col_4><body>12</col_4><col_5><body>2</col_5></row_2>
<row_3><col_0><body></col_0><col_1><body>13</col_1><col_2><body>14</col_2><col_3><body>15</col_3><col_4><body>16</col_4><col_5><body>2</col_5></row_3>
<row_4><col_0><body></col_0><col_1><body>17</col_1><col_2><body>18</col_2><col_3><body>19</col_3><col_4><body>20</col_4><col_5><body>2</col_5></row_4>
<row_0><col_0><body>0</col_0><col_1><body>1 2 1</col_1><col_2><body>1 2 1</col_2><col_3><body>1 2 1</col_3><col_4><body>1 2 1</col_4></row_0>
<row_1><col_0><body>3</col_0><col_1><body>4 3</col_1><col_2><body>5</col_2><col_3><body>6</col_3><col_4><body>7</col_4></row_1>
<row_2><col_0><body>8 2</col_0><col_1><body>9</col_1><col_2><body>10</col_2><col_3><body>11</col_3><col_4><body>12</col_4></row_2>
<row_3><col_0><body>13</col_0><col_1><body></col_1><col_2><body>14</col_2><col_3><body>15</col_3><col_4><body>16</col_4></row_3>
<row_4><col_0><body>17</col_0><col_1><body>18</col_1><col_2><body></col_2><col_3><body>19</col_3><col_4><body>20</col_4></row_4>
</table>
<paragraph><location><page_1><loc_50><loc_16><loc_89><loc_26></location>Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.</paragraph>
<paragraph><location><page_1><loc_50><loc_10><loc_89><loc_16></location>The first problem is called table-location and has been previously addressed [30, 38, 19, 21, 23, 26, 8] with stateof-the-art object-detection networks (e.g. YOLO and later on Mask-RCNN [9]). For all practical purposes, it can be</paragraph>
@ -138,9 +138,9 @@
<location><page_7><loc_50><loc_62><loc_87><loc_69></location>
<caption>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
<row_0><col_0><col_header>Model</col_0><col_1><col_header>Dataset</col_1><col_2><col_header>mAP</col_2><col_3><col_header>mAP (PP)</col_3></row_0>
<row_1><col_0><body>EDD+BBox</col_0><col_1><body>PubTabNet</col_1><col_2><body>79.2</col_2><col_3><body>82.7</col_3></row_1>
<row_2><col_0><body>TableFormer</col_0><col_1><body>PubTabNet</col_1><col_2><body>82.1</col_2><col_3><body>86.8</col_3></row_2>
<row_3><col_0><body>TableFormer</col_0><col_1><body>SynthTabNet</col_1><col_2><body>87.7</col_2><col_3><body>-</col_3></row_3>
<row_1><col_0><row_header>EDD+BBox</col_0><col_1><body>PubTabNet</col_1><col_2><body>79.2</col_2><col_3><body>82.7</col_3></row_1>
<row_2><col_0><row_header>TableFormer</col_0><col_1><body>PubTabNet</col_1><col_2><body>82.1</col_2><col_3><body>86.8</col_3></row_2>
<row_3><col_0><row_header>TableFormer</col_0><col_1><body>SynthTabNet</col_1><col_2><body>87.7</col_2><col_3><body>-</col_3></row_3>
</table>
<caption><location><page_7><loc_50><loc_57><loc_89><loc_60></location>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
<paragraph><location><page_7><loc_50><loc_34><loc_89><loc_54></location>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</paragraph>
@ -179,7 +179,7 @@
<row_6><col_0><row_header>第 17 回人工知能学会全国大会 (2003)</col_0><col_1><body>208</col_1><col_2><body>5</col_2><col_3><body>203</col_3><col_4><body>152</col_4><col_5><body>244</col_5></row_6>
<row_7><col_0><row_header>自然言語処理研究会第 146 〜 155 回</col_0><col_1><body>98</col_1><col_2><body>2</col_2><col_3><body>96</col_3><col_4><body>150</col_4><col_5><body>232</col_5></row_7>
<row_8><col_0><row_header>WWW から収集した論文</col_0><col_1><body>107</col_1><col_2><body>73</col_2><col_3><body>34</col_3><col_4><body>147</col_4><col_5><body>96</col_5></row_8>
<row_9><col_0><body></col_0><col_1><body>945</col_1><col_2><body>294</col_2><col_3><body>651</col_3><col_4><body>1122</col_4><col_5><body>955</col_5></row_9>
<row_9><col_0><row_header>計</col_0><col_1><body>945</col_1><col_2><body>294</col_2><col_3><body>651</col_3><col_4><body>1122</col_4><col_5><body>955</col_5></row_9>
</table>
<caption><location><page_8><loc_62><loc_62><loc_90><loc_63></location>Text is aligned to match original for ease of viewing</caption>
<table>

File diff suppressed because one or more lines are too long

View File

@ -25,12 +25,12 @@ The occurrence of tables in documents is ubiquitous. They often summarise quanti
Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
<!-- image -->
| 0 | 1 | 1 | 2 1 | 2 1 | |
|-----|-----|-----|-------|-------|----|
| 3 | 4 | 5 3 | 6 | 7 | |
| 8 | 9 | 10 | 11 | 12 | 2 |
| | 13 | 14 | 15 | 16 | 2 |
| | 17 | 18 | 19 | 20 | 2 |
| 0 | 1 2 1 | 1 2 1 | 1 2 1 | 1 2 1 |
|-----|---------|---------|---------|---------|
| 3 | 4 3 | 5 | 6 | 7 |
| 8 2 | 9 | 10 | 11 | 12 |
| 13 | | 14 | 15 | 16 |
| 17 | 18 | | 19 | 20 |
Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.
@ -241,7 +241,7 @@ Text is aligned to match original for ease of viewing
| 第 17 回人工知能学会全国大会 (2003) | 208 | 5 | 203 | 152 | 244 |
| 自然言語処理研究会第 146 〜 155 回 | 98 | 2 | 96 | 150 | 232 |
| WWW から収集した論文 | 107 | 73 | 34 | 147 | 96 |
| | 945 | 294 | 651 | 1122 | 955 |
| | 945 | 294 | 651 | 1122 | 955 |
| | Shares (in millions) | Shares (in millions) | Weighted Average Grant Date Fair Value | Weighted Average Grant Date Fair Value |
|--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|

File diff suppressed because one or more lines are too long

View File

@ -56,7 +56,7 @@
<table>
<location><page_4><loc_16><loc_63><loc_84><loc_83></location>
<caption>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption>
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>% of Total</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
<row_1><col_0><col_header>class label</col_0><col_1><col_header>Count</col_1><col_2><col_header>Train</col_2><col_3><col_header>Test</col_3><col_4><col_header>Val</col_4><col_5><col_header>All</col_5><col_6><col_header>Fin</col_6><col_7><col_header>Man</col_7><col_8><col_header>Sci</col_8><col_9><col_header>Law</col_9><col_10><col_header>Pat</col_10><col_11><col_header>Ten</col_11></row_1>
<row_2><col_0><row_header>Caption</col_0><col_1><body>22524</col_1><col_2><body>2.04</col_2><col_3><body>1.77</col_3><col_4><body>2.32</col_4><col_5><body>84-89</col_5><col_6><body>40-61</col_6><col_7><body>86-92</col_7><col_8><body>94-99</col_8><col_9><body>95-99</col_9><col_10><body>69-78</col_10><col_11><body>n/a</col_11></row_2>
<row_3><col_0><row_header>Footnote</col_0><col_1><body>6318</col_1><col_2><body>0.60</col_2><col_3><body>0.31</col_3><col_4><body>0.58</col_4><col_5><body>83-91</col_5><col_6><body>n/a</col_6><col_7><body>100</col_7><col_8><body>62-88</col_8><col_9><body>85-94</col_9><col_10><body>n/a</col_10><col_11><body>82-97</col_11></row_3>
@ -102,7 +102,7 @@
<table>
<location><page_6><loc_10><loc_56><loc_47><loc_75></location>
<row_0><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>MRCNN</col_2><col_3><col_header>MRCNN</col_3><col_4><col_header>FRCNN</col_4><col_5><col_header>YOLO</col_5></row_0>
<row_1><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>R50</col_2><col_3><col_header>R101</col_3><col_4><col_header>R101</col_4><col_5><col_header>v5x6</col_5></row_1>
<row_1><col_0><body></col_0><col_1><body></col_1><col_2><col_header>R50</col_2><col_3><col_header>R101</col_3><col_4><col_header>R101</col_4><col_5><col_header>v5x6</col_5></row_1>
<row_2><col_0><row_header>Caption</col_0><col_1><body>84-89</col_1><col_2><body>68.4</col_2><col_3><body>71.5</col_3><col_4><body>70.1</col_4><col_5><body>77.7</col_5></row_2>
<row_3><col_0><row_header>Footnote</col_0><col_1><body>83-91</col_1><col_2><body>70.9</col_2><col_3><body>71.8</col_3><col_4><body>73.7</col_4><col_5><body>77.2</col_5></row_3>
<row_4><col_0><row_header>Formula</col_0><col_1><body>83-85</col_1><col_2><body>60.1</col_2><col_3><body>63.4</col_3><col_4><body>63.5</col_4><col_5><body>66.2</col_5></row_4>
@ -130,7 +130,7 @@
<paragraph><location><page_7><loc_9><loc_84><loc_48><loc_89></location>Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.</paragraph>
<table>
<location><page_7><loc_13><loc_63><loc_44><loc_81></location>
<row_0><col_0><col_header>Class-count</col_0><col_1><col_header>11</col_1><col_2><col_header>6</col_2><col_3><col_header>5</col_3><col_4><col_header>4</col_4></row_0>
<row_0><col_0><body>Class-count</col_0><col_1><col_header>11</col_1><col_2><col_header>6</col_2><col_3><col_header>5</col_3><col_4><col_header>4</col_4></row_0>
<row_1><col_0><row_header>Caption</col_0><col_1><body>68</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_1>
<row_2><col_0><row_header>Footnote</col_0><col_1><body>71</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_2>
<row_3><col_0><row_header>Formula</col_0><col_1><body>60</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_3>
@ -178,17 +178,17 @@
<row_1><col_0><col_header>Training on</col_0><col_1><col_header>labels</col_1><col_2><col_header>PLN</col_2><col_3><col_header>DB</col_3><col_4><col_header>DLN</col_4></row_1>
<row_2><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Figure</col_1><col_2><body>96</col_2><col_3><body>43</col_3><col_4><body>23</col_4></row_2>
<row_3><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Sec-header</col_1><col_2><body>87</col_2><col_3><body>-</col_3><col_4><body>32</col_4></row_3>
<row_4><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Table</col_1><col_2><body>95</col_2><col_3><body>24</col_3><col_4><body>49</col_4></row_4>
<row_5><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Text</col_1><col_2><body>96</col_2><col_3><body>-</col_3><col_4><body>42</col_4></row_5>
<row_6><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>total</col_1><col_2><body>93</col_2><col_3><body>34</col_3><col_4><body>30</col_4></row_6>
<row_4><col_0><body></col_0><col_1><row_header>Table</col_1><col_2><body>95</col_2><col_3><body>24</col_3><col_4><body>49</col_4></row_4>
<row_5><col_0><body></col_0><col_1><row_header>Text</col_1><col_2><body>96</col_2><col_3><body>-</col_3><col_4><body>42</col_4></row_5>
<row_6><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>93</col_2><col_3><body>34</col_3><col_4><body>30</col_4></row_6>
<row_7><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>Figure</col_1><col_2><body>77</col_2><col_3><body>71</col_3><col_4><body>31</col_4></row_7>
<row_8><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>Table</col_1><col_2><body>19</col_2><col_3><body>65</col_3><col_4><body>22</col_4></row_8>
<row_9><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>total</col_1><col_2><body>48</col_2><col_3><body>68</col_3><col_4><body>27</col_4></row_9>
<row_10><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Figure</col_1><col_2><body>67</col_2><col_3><body>51</col_3><col_4><body>72</col_4></row_10>
<row_11><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Sec-header</col_1><col_2><body>53</col_2><col_3><body>-</col_3><col_4><body>68</col_4></row_11>
<row_12><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Table</col_1><col_2><body>87</col_2><col_3><body>43</col_3><col_4><body>82</col_4></row_12>
<row_13><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Text</col_1><col_2><body>77</col_2><col_3><body>-</col_3><col_4><body>84</col_4></row_13>
<row_14><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
<row_12><col_0><body></col_0><col_1><row_header>Table</col_1><col_2><body>87</col_2><col_3><body>43</col_3><col_4><body>82</col_4></row_12>
<row_13><col_0><body></col_0><col_1><row_header>Text</col_1><col_2><body>77</col_2><col_3><body>-</col_3><col_4><body>84</col_4></row_13>
<row_14><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
</table>
<paragraph><location><page_8><loc_9><loc_44><loc_48><loc_51></location>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</paragraph>
<paragraph><location><page_8><loc_9><loc_26><loc_48><loc_44></location>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</paragraph>

File diff suppressed because one or more lines are too long

View File

@ -98,21 +98,21 @@ The annotation campaign was carried out in four phases. In phase one, we identif
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.
| | | % of Total | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|----------------|---------|--------------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
<!-- image -->
@ -161,7 +161,7 @@ Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on D
| | human | MRCNN | MRCNN | FRCNN | YOLO |
|----------------|---------|---------|---------|---------|--------|
| | human | R50 | R101 | R101 | v5x6 |
| | | R50 | R101 | R101 | v5x6 |
| Caption | 84-89 | 68.4 | 71.5 | 70.1 | 77.7 |
| Footnote | 83-91 | 70.9 | 71.8 | 73.7 | 77.2 |
| Formula | 83-85 | 60.1 | 63.4 | 63.5 | 66.2 |
@ -252,17 +252,17 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
| Training on | labels | PLN | DB | DLN |
| PubLayNet (PLN) | Figure | 96 | 43 | 23 |
| PubLayNet (PLN) | Sec-header | 87 | - | 32 |
| PubLayNet (PLN) | Table | 95 | 24 | 49 |
| PubLayNet (PLN) | Text | 96 | - | 42 |
| PubLayNet (PLN) | total | 93 | 34 | 30 |
| | Table | 95 | 24 | 49 |
| | Text | 96 | - | 42 |
| | total | 93 | 34 | 30 |
| DocBank (DB) | Figure | 77 | 71 | 31 |
| DocBank (DB) | Table | 19 | 65 | 22 |
| DocBank (DB) | total | 48 | 68 | 27 |
| DocLayNet (DLN) | Figure | 67 | 51 | 72 |
| DocLayNet (DLN) | Sec-header | 53 | - | 68 |
| DocLayNet (DLN) | Table | 87 | 43 | 82 |
| DocLayNet (DLN) | Text | 77 | - | 84 |
| DocLayNet (DLN) | total | 59 | 47 | 78 |
| | Table | 87 | 43 | 82 |
| | Text | 77 | - | 84 |
| | total | 59 | 47 | 78 |
Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .

File diff suppressed because one or more lines are too long

View File

@ -5,13 +5,12 @@
<table>
<location><page_1><loc_23><loc_41><loc_78><loc_57></location>
<caption>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
<row_0><col_0><col_header>#</col_0><col_1><col_header>#</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
<row_1><col_0><col_header>enc-layers</col_0><col_1><col_header>dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
<row_0><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
<row_1><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
<row_2><col_0><body>6</col_0><col_1><body>6</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.965 0.969</col_3><col_4><body>0.934 0.927</col_4><col_5><body>0.955 0.955</col_5><col_6><body>0.88 0.857</col_6><col_7><body>2.73 5.39</col_7></row_2>
<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938</col_3><col_4><body>0.904</col_4><col_5><body>0.927</col_5><col_6><body>0.853</col_6><col_7><body>1.97</col_7></row_3>
<row_4><col_0><body></col_0><col_1><body></col_1><col_2><body>OTSL</col_2><col_3><body>0.952 0.923</col_3><col_4><body>0.909</col_4><col_5><body>0.938</col_5><col_6><body>0.843</col_6><col_7><body>3.77</col_7></row_4>
<row_5><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>HTML</col_2><col_3><body>0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_5>
<row_6><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_6>
<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904 0.909</col_4><col_5><body>0.927 0.938</col_5><col_6><body>0.853 0.843</col_6><col_7><body>1.97 3.77</col_7></row_3>
<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_4>
<row_5><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_5>
</table>
<caption><location><page_1><loc_22><loc_59><loc_79><loc_66></location>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
<subtitle-level-1><location><page_1><loc_22><loc_35><loc_43><loc_36></location>5.2 Quantitative Results</subtitle-level-1>

File diff suppressed because one or more lines are too long

View File

@ -6,14 +6,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
| enc-layers | dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| 4 | 4 | OTSL HTML | 0.938 | 0.904 | 0.927 | 0.853 | 1.97 |
| | | OTSL | 0.952 0.923 | 0.909 | 0.938 | 0.843 | 3.77 |
| 2 | 4 | HTML | 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
| # enc-layers | # dec-layers | Language | TEDs | TEDs | TEDs | mAP | Inference |
|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
| # enc-layers | # dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77 |
| 2 | 4 | OTSL HTML | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
## 5.2 Quantitative Results

File diff suppressed because one or more lines are too long

View File

@ -77,13 +77,12 @@
<table>
<location><page_9><loc_23><loc_41><loc_78><loc_57></location>
<caption>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
<row_0><col_0><col_header>#</col_0><col_1><col_header>#</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
<row_1><col_0><col_header>enc-layers</col_0><col_1><col_header>dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
<row_0><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
<row_1><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
<row_2><col_0><body>6</col_0><col_1><body>6</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.965 0.969</col_3><col_4><body>0.934 0.927</col_4><col_5><body>0.955 0.955</col_5><col_6><body>0.88 0.857</col_6><col_7><body>2.73 5.39</col_7></row_2>
<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904</col_4><col_5><body>0.927</col_5><col_6><body>0.853</col_6><col_7><body>1.97</col_7></row_3>
<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.909 0.897</col_4><col_5><body>0.938</col_5><col_6><body>0.843</col_6><col_7><body>3.77</col_7></row_4>
<row_5><col_0><body></col_0><col_1><body></col_1><col_2><body>HTML</col_2><col_3><body></col_3><col_4><body>0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_5>
<row_6><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_6>
<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904 0.909</col_4><col_5><body>0.927 0.938</col_5><col_6><body>0.853 0.843</col_6><col_7><body>1.97 3.77</col_7></row_3>
<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_4>
<row_5><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_5>
</table>
<caption><location><page_9><loc_22><loc_59><loc_79><loc_65></location>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
<subtitle-level-1><location><page_9><loc_22><loc_35><loc_43><loc_36></location>5.2 Quantitative Results</subtitle-level-1>
@ -92,14 +91,11 @@
<table>
<location><page_10><loc_23><loc_67><loc_77><loc_80></location>
<caption>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption>
<row_0><col_0><body></col_0><col_1><col_header>Language</col_1><col_2><col_header>TEDs</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_0>
<row_1><col_0><body></col_0><col_1><col_header>Language</col_1><col_2><col_header>simple</col_2><col_3><col_header>complex</col_3><col_4><col_header>all</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_1>
<row_2><col_0><row_header>PubTabNet</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.965</col_2><col_3><body>0.934</col_3><col_4><body>0.955</col_4><col_5><body>0.88</col_5><col_6><body>2.73</col_6></row_2>
<row_3><col_0><row_header>PubTabNet</col_0><col_1><row_header>HTML</col_1><col_2><body>0.969</col_2><col_3><body>0.927</col_3><col_4><body>0.955</col_4><col_5><body>0.857</col_5><col_6><body>5.39</col_6></row_3>
<row_4><col_0><row_header>FinTabNet</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.955</col_2><col_3><body>0.961</col_3><col_4><body>0.959</col_4><col_5><body>0.862</col_5><col_6><body>1.85</col_6></row_4>
<row_5><col_0><row_header>FinTabNet</col_0><col_1><row_header>HTML</col_1><col_2><body>0.917</col_2><col_3><body>0.922</col_3><col_4><body>0.92</col_4><col_5><body>0.722</col_5><col_6><body>3.26</col_6></row_5>
<row_6><col_0><row_header>PubTables-1M</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.987</col_2><col_3><body>0.964</col_3><col_4><body>0.977</col_4><col_5><body>0.896</col_5><col_6><body>1.79</col_6></row_6>
<row_7><col_0><row_header>PubTables-1M</col_0><col_1><row_header>HTML</col_1><col_2><body>0.983</col_2><col_3><body>0.944</col_3><col_4><body>0.966</col_4><col_5><body>0.889</col_5><col_6><body>3.26</col_6></row_7>
<row_0><col_0><col_header>Data set</col_0><col_1><col_header>Language</col_1><col_2><col_header>TEDs</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_0>
<row_1><col_0><col_header>Data set</col_0><col_1><col_header>Language</col_1><col_2><col_header>simple</col_2><col_3><col_header>complex</col_3><col_4><col_header>all</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_1>
<row_2><col_0><body>PubTabNet</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.965 0.969</col_2><col_3><body>0.934 0.927</col_3><col_4><body>0.955 0.955</col_4><col_5><body>0.88 0.857</col_5><col_6><body>2.73 5.39</col_6></row_2>
<row_3><col_0><body>FinTabNet</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.955 0.917</col_2><col_3><body>0.961 0.922</col_3><col_4><body>0.959 0.92</col_4><col_5><body>0.862 0.722</col_5><col_6><body>1.85 3.26</col_6></row_3>
<row_4><col_0><body>PubTables-1M</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.987 0.983</col_2><col_3><body>0.964 0.944</col_3><col_4><body>0.977 0.966</col_4><col_5><body>0.896 0.889</col_5><col_6><body>1.79 3.26</col_6></row_4>
</table>
<caption><location><page_10><loc_22><loc_82><loc_79><loc_85></location>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption>
<subtitle-level-1><location><page_10><loc_22><loc_62><loc_42><loc_64></location>5.3 Qualitative Results</subtitle-level-1>

File diff suppressed because one or more lines are too long

View File

@ -130,14 +130,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
| enc-layers | dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 | 0.927 | 0.853 | 1.97 |
| 2 | 4 | OTSL | 0.923 0.945 | 0.909 0.897 | 0.938 | 0.843 | 3.77 |
| | | HTML | | 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
| # enc-layers | # dec-layers | Language | TEDs | TEDs | TEDs | mAP | Inference |
|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
| # enc-layers | # dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77 |
| 2 | 4 | OTSL HTML | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
## 5.2 Quantitative Results
@ -147,15 +146,12 @@ Additionally, the results show that OTSL has an advantage over HTML when applied
Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).
| | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
|--------------|------------|--------|---------|--------|-------------|-------------------------|
| | Language | simple | complex | all | mAP(0.75) | Inference time (secs) |
| PubTabNet | OTSL | 0.965 | 0.934 | 0.955 | 0.88 | 2.73 |
| PubTabNet | HTML | 0.969 | 0.927 | 0.955 | 0.857 | 5.39 |
| FinTabNet | OTSL | 0.955 | 0.961 | 0.959 | 0.862 | 1.85 |
| FinTabNet | HTML | 0.917 | 0.922 | 0.92 | 0.722 | 3.26 |
| PubTables-1M | OTSL | 0.987 | 0.964 | 0.977 | 0.896 | 1.79 |
| PubTables-1M | HTML | 0.983 | 0.944 | 0.966 | 0.889 | 3.26 |
| Data set | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
|--------------|------------|-------------|-------------|-------------|-------------|-------------------------|
| Data set | Language | simple | complex | all | mAP(0.75) | Inference time (secs) |
| PubTabNet | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| FinTabNet | OTSL HTML | 0.955 0.917 | 0.961 0.922 | 0.959 0.92 | 0.862 0.722 | 1.85 3.26 |
| PubTables-1M | OTSL HTML | 0.987 0.983 | 0.964 0.944 | 0.977 0.966 | 0.896 0.889 | 1.79 3.26 |
## 5.3 Qualitative Results

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -7,7 +7,9 @@
<figure>
<location><page_1><loc_5><loc_11><loc_96><loc_63></location>
</figure>
<paragraph><location><page_1><loc_47><loc_94><loc_68><loc_96></location>Front cover</paragraph>
<figure>
<location><page_1><loc_52><loc_2><loc_95><loc_10></location>
</figure>
<subtitle-level-1><location><page_2><loc_11><loc_88><loc_28><loc_91></location>Contents</subtitle-level-1>
<paragraph><location><page_3><loc_11><loc_89><loc_39><loc_91></location>DB2 for i Center of Excellence</paragraph>
<paragraph><location><page_3><loc_15><loc_80><loc_38><loc_83></location>Solution Brief IBM Systems Lab Services and Training</paragraph>
@ -128,7 +130,7 @@
<table>
<location><page_9><loc_11><loc_9><loc_89><loc_50></location>
<caption>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
<row_0><col_0><row_header>User action</col_0><col_1><body>*JOBCTL</col_1><col_2><body>QIBM_DB_SECADM</col_2><col_3><body>QIBM_DB_SQLADM</col_3><col_4><body>QIBM_DB_SYSMON</col_4><col_5><body>No Authority</col_5></row_0>
<row_0><col_0><body>User action</col_0><col_1><col_header>*JOBCTL</col_1><col_2><col_header>QIBM_DB_SECADM</col_2><col_3><col_header>QIBM_DB_SQLADM</col_3><col_4><col_header>QIBM_DB_SYSMON</col_4><col_5><col_header>No Authority</col_5></row_0>
<row_1><col_0><row_header>SET CURRENT DEGREE (SQL statement)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_1>
<row_2><col_0><row_header>CHGQRYA command targeting a different users job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_2>
<row_3><col_0><row_header>STRDBMON or ENDDBMON commands targeting a different users job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_3>

File diff suppressed because one or more lines are too long

View File

@ -6,7 +6,7 @@ Front cover
<!-- image -->
Front cover
<!-- image -->
## Contents

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -8,13 +8,13 @@
<section_header_level_1><loc_41><loc_341><loc_104><loc_348>1. Introduction</section_header_level_1>
<text><loc_41><loc_354><loc_234><loc_450>The occurrence of tables in documents is ubiquitous. They often summarise quantitative or factual data, which is cumbersome to describe in verbose text but nevertheless extremely valuable. Unfortunately, this compact representation is often not easy to parse by machines. There are many implicit conventions used to obtain a compact table representation. For example, tables often have complex columnand row-headers in order to reduce duplicated cell content. Lines of different shapes and sizes are leveraged to separate content or indicate a tree structure. Additionally, tables can also have empty/missing table-entries or multi-row textual table-entries. Fig. 1 shows a table which presents all these issues.</text>
<picture><loc_258><loc_144><loc_439><loc_191></picture>
<otsl><loc_258><loc_144><loc_439><loc_191><ched>3<ched>1<nl></otsl>
<otsl><loc_258><loc_144><loc_439><loc_191><ched>1<nl></otsl>
<unordered_list><list_item><loc_258><loc_198><loc_397><loc_210>b. Red-annotation of bounding boxes, Blue-predictions by TableFormer</list_item>
<list_item><loc_258><loc_265><loc_401><loc_271>c. Structure predicted by TableFormer:</list_item>
</unordered_list>
<picture><loc_257><loc_213><loc_441><loc_259></picture>
<picture><loc_258><loc_274><loc_439><loc_313><caption><loc_252><loc_325><loc_445><loc_353>Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.</caption></picture>
<otsl><loc_258><loc_274><loc_439><loc_313><ched>0<ched>1<lcel><ched>2 1<lcel><ecel><nl><fcel>3<fcel>4<fcel>5 3<fcel>6<fcel>7<ecel><nl><fcel>8<fcel>9<fcel>10<fcel>11<fcel>12<fcel>2<nl><ecel><fcel>13<fcel>14<fcel>15<fcel>16<ucel><nl><ecel><fcel>17<fcel>18<fcel>19<fcel>20<ucel><nl></otsl>
<otsl><loc_258><loc_274><loc_439><loc_313><fcel>0<fcel>1 2 1<lcel><lcel><lcel><nl><fcel>3<fcel>4 3<fcel>5<fcel>6<fcel>7<nl><fcel>8 2<fcel>9<fcel>10<fcel>11<fcel>12<nl><fcel>13<ecel><fcel>14<fcel>15<fcel>16<nl><fcel>17<fcel>18<ecel><fcel>19<fcel>20<nl></otsl>
<text><loc_252><loc_369><loc_445><loc_420>Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.</text>
<text><loc_252><loc_422><loc_445><loc_450>The first problem is called table-location and has been previously addressed [30, 38, 19, 21, 23, 26, 8] with stateof-the-art object-detection networks (e.g. YOLO and later on Mask-RCNN [9]). For all practical purposes, it can be</text>
<page_footer><loc_241><loc_463><loc_245><loc_469>1</page_footer>
@ -102,7 +102,7 @@
<text><loc_41><loc_374><loc_234><loc_387>Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN).</text>
<text><loc_41><loc_389><loc_214><loc_395>FT: Model was trained on PubTabNet then finetuned.</text>
<text><loc_41><loc_407><loc_234><loc_450><loc_41><loc_407><loc_234><loc_450>Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.</text>
<otsl><loc_252><loc_156><loc_436><loc_192><ched>Model<ched>Dataset<ched>mAP<ched>mAP (PP)<nl><fcel>EDD+BBox<fcel>PubTabNet<fcel>79.2<fcel>82.7<nl><fcel>TableFormer<fcel>PubTabNet<fcel>82.1<fcel>86.8<nl><fcel>TableFormer<fcel>SynthTabNet<fcel>87.7<fcel>-<nl><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
<otsl><loc_252><loc_156><loc_436><loc_192><ched>Model<ched>Dataset<ched>mAP<ched>mAP (PP)<nl><rhed>EDD+BBox<fcel>PubTabNet<fcel>79.2<fcel>82.7<nl><rhed>TableFormer<fcel>PubTabNet<fcel>82.1<fcel>86.8<nl><rhed>TableFormer<fcel>SynthTabNet<fcel>87.7<fcel>-<nl><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
<text><loc_252><loc_232><loc_445><loc_328>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</text>
<otsl><loc_272><loc_341><loc_426><loc_406><fcel>Model<ched>Simple<ched>TEDS Complex<ched>All<nl><rhed>Tabula<fcel>78.0<fcel>57.8<fcel>67.9<nl><rhed>Traprange<fcel>60.8<fcel>49.9<fcel>55.4<nl><rhed>Camelot<fcel>80.0<fcel>66.0<fcel>73.0<nl><rhed>Acrobat Pro<fcel>68.9<fcel>61.8<fcel>65.3<nl><rhed>EDD<fcel>91.2<fcel>85.4<fcel>88.3<nl><rhed>TableFormer<fcel>95.4<fcel>90.1<fcel>93.6<nl><caption><loc_252><loc_415><loc_445><loc_435>Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.</caption></otsl>
<page_footer><loc_241><loc_463><loc_245><loc_469>7</page_footer>
@ -114,7 +114,7 @@
<section_header_level_1><loc_249><loc_60><loc_352><loc_64>Example table from FinTabNet:</section_header_level_1>
<picture><loc_41><loc_65><loc_246><loc_118></picture>
<picture><loc_250><loc_62><loc_453><loc_114><caption><loc_44><loc_131><loc_315><loc_136>b. Structure predicted by TableFormer, with superimposed matched PDF cell text:</caption></picture>
<otsl><loc_44><loc_138><loc_244><loc_185><ecel><ecel><ched>論文ファイル<lcel><ched>参考文献<lcel><nl><ched>出典<ched>ファイル 数<ched>英語<ched>日本語<ched>英語<ched>日本語<nl><rhed>Association for Computational Linguistics(ACL2003)<fcel>65<fcel>65<fcel>0<fcel>150<fcel>0<nl><rhed>Computational Linguistics(COLING2002)<fcel>140<fcel>140<fcel>0<fcel>150<fcel>0<nl><rhed>電気情報通信学会 2003 年総合大会<fcel>150<fcel>8<fcel>142<fcel>223<fcel>147<nl><rhed>情報処理学会第 65 回全国大会 (2003)<fcel>177<fcel>1<fcel>176<fcel>150<fcel>236<nl><rhed>第 17 回人工知能学会全国大会 (2003)<fcel>208<fcel>5<fcel>203<fcel>152<fcel>244<nl><rhed>自然言語処理研究会第 146 〜 155 回<fcel>98<fcel>2<fcel>96<fcel>150<fcel>232<nl><rhed>WWW から収集した論文<fcel>107<fcel>73<fcel>34<fcel>147<fcel>96<nl><ecel><fcel>945<fcel>294<fcel>651<fcel>1122<fcel>955<nl><caption><loc_311><loc_185><loc_449><loc_189>Text is aligned to match original for ease of viewing</caption></otsl>
<otsl><loc_44><loc_138><loc_244><loc_185><ecel><ecel><ched>論文ファイル<lcel><ched>参考文献<lcel><nl><ched>出典<ched>ファイル 数<ched>英語<ched>日本語<ched>英語<ched>日本語<nl><rhed>Association for Computational Linguistics(ACL2003)<fcel>65<fcel>65<fcel>0<fcel>150<fcel>0<nl><rhed>Computational Linguistics(COLING2002)<fcel>140<fcel>140<fcel>0<fcel>150<fcel>0<nl><rhed>電気情報通信学会 2003 年総合大会<fcel>150<fcel>8<fcel>142<fcel>223<fcel>147<nl><rhed>情報処理学会第 65 回全国大会 (2003)<fcel>177<fcel>1<fcel>176<fcel>150<fcel>236<nl><rhed>第 17 回人工知能学会全国大会 (2003)<fcel>208<fcel>5<fcel>203<fcel>152<fcel>244<nl><rhed>自然言語処理研究会第 146 〜 155 回<fcel>98<fcel>2<fcel>96<fcel>150<fcel>232<nl><rhed>WWW から収集した論文<fcel>107<fcel>73<fcel>34<fcel>147<fcel>96<nl><rhed>計<fcel>945<fcel>294<fcel>651<fcel>1122<fcel>955<nl><caption><loc_311><loc_185><loc_449><loc_189>Text is aligned to match original for ease of viewing</caption></otsl>
<otsl><loc_249><loc_138><loc_450><loc_182><ecel><ched>Shares (in millions)<lcel><ched>Weighted Average Grant Date Fair Value<lcel><nl><ecel><ched>RS U s<ched>PSUs<ched>RSUs<ched>PSUs<nl><rhed>Nonvested on Janua ry 1<fcel>1. 1<fcel>0.3<fcel>90.10 $<fcel>$ 91.19<nl><rhed>Granted<fcel>0. 5<fcel>0.1<fcel>117.44<fcel>122.41<nl><rhed>Vested<fcel>(0. 5 )<fcel>(0.1)<fcel>87.08<fcel>81.14<nl><rhed>Canceled or forfeited<fcel>(0. 1 )<fcel>-<fcel>102.01<fcel>92.18<nl><rhed>Nonvested on December 31<fcel>1.0<fcel>0.3<fcel>104.85 $<fcel>$ 104.51<nl></otsl>
<picture><loc_42><loc_240><loc_173><loc_280><caption><loc_51><loc_290><loc_435><loc_295>Figure 6: An example of TableFormer predictions (bounding boxes and structure) from generated SynthTabNet table.</caption></picture>
<picture><loc_177><loc_240><loc_307><loc_280><caption><loc_41><loc_203><loc_445><loc_231>Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.</caption></picture>

File diff suppressed because one or more lines are too long

View File

@ -25,12 +25,12 @@ Figure 1: Picture of a table with subtle, complex features such as (1) multi-col
<!-- image -->
| 0 | 1 | 1 | 2 1 | 2 1 | |
|-----|-----|-----|-------|-------|----|
| 3 | 4 | 5 3 | 6 | 7 | |
| 8 | 9 | 10 | 11 | 12 | 2 |
| | 13 | 14 | 15 | 16 | 2 |
| | 17 | 18 | 19 | 20 | 2 |
| 0 | 1 2 1 | 1 2 1 | 1 2 1 | 1 2 1 |
|-----|---------|---------|---------|---------|
| 3 | 4 3 | 5 | 6 | 7 |
| 8 2 | 9 | 10 | 11 | 12 |
| 13 | | 14 | 15 | 16 |
| 17 | 18 | | 19 | 20 |
Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.
@ -247,7 +247,7 @@ Text is aligned to match original for ease of viewing
| 第 17 回人工知能学会全国大会 (2003) | 208 | 5 | 203 | 152 | 244 |
| 自然言語処理研究会第 146 〜 155 回 | 98 | 2 | 96 | 150 | 232 |
| WWW から収集した論文 | 107 | 73 | 34 | 147 | 96 |
| | 945 | 294 | 651 | 1122 | 955 |
| | 945 | 294 | 651 | 1122 | 955 |
| | Shares (in millions) | Shares (in millions) | Weighted Average Grant Date Fair Value | Weighted Average Grant Date Fair Value |
|--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|

File diff suppressed because one or more lines are too long

View File

@ -58,7 +58,7 @@
<text><loc_260><loc_399><loc_457><loc_446>The annotation campaign was carried out in four phases. In phase one, we identified and prepared the data sources for annotation. In phase two, we determined the class labels and how annotations should be done on the documents in order to obtain maximum consistency. The latter was guided by a detailed requirement analysis and exhaustive experiments. In phase three, we trained the annotation staff and performed exams for quality assurance. In phase four,</text>
<page_break>
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
<picture><loc_43><loc_196><loc_242><loc_341><caption><loc_44><loc_350><loc_242><loc_383>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption></picture>
<text><loc_44><loc_400><loc_240><loc_426>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</text>
<text><loc_44><loc_428><loc_241><loc_447><loc_44><loc_428><loc_241><loc_447>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv$^{3}$, government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
@ -88,7 +88,7 @@
<page_break>
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
<text><loc_44><loc_55><loc_242><loc_116>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</text>
<otsl><loc_51><loc_124><loc_233><loc_222><ecel><ched>human<ched>MRCNN<lcel><ched>FRCNN<ched>YOLO<nl><ecel><ucel><ched>R50<ched>R101<ched>R101<ched>v5x6<nl><rhed>Caption<fcel>84-89<fcel>68.4<fcel>71.5<fcel>70.1<fcel>77.7<nl><rhed>Footnote<fcel>83-91<fcel>70.9<fcel>71.8<fcel>73.7<fcel>77.2<nl><rhed>Formula<fcel>83-85<fcel>60.1<fcel>63.4<fcel>63.5<fcel>66.2<nl><rhed>List-item<fcel>87-88<fcel>81.2<fcel>80.8<fcel>81.0<fcel>86.2<nl><rhed>Page-footer<fcel>93-94<fcel>61.6<fcel>59.3<fcel>58.9<fcel>61.1<nl><rhed>Page-header<fcel>85-89<fcel>71.9<fcel>70.0<fcel>72.0<fcel>67.9<nl><rhed>Picture<fcel>69-71<fcel>71.7<fcel>72.7<fcel>72.0<fcel>77.1<nl><rhed>Section-header<fcel>83-84<fcel>67.6<fcel>69.3<fcel>68.4<fcel>74.6<nl><rhed>Table<fcel>77-81<fcel>82.2<fcel>82.9<fcel>82.2<fcel>86.3<nl><rhed>Text<fcel>84-86<fcel>84.6<fcel>85.8<fcel>85.4<fcel>88.1<nl><rhed>Title<fcel>60-72<fcel>76.7<fcel>80.4<fcel>79.9<fcel>82.7<nl><rhed>All<fcel>82-83<fcel>72.4<fcel>73.5<fcel>73.4<fcel>76.8<nl></otsl>
<otsl><loc_51><loc_124><loc_233><loc_222><ecel><ched>human<ched>MRCNN<lcel><ched>FRCNN<ched>YOLO<nl><ecel><ecel><ched>R50<ched>R101<ched>R101<ched>v5x6<nl><rhed>Caption<fcel>84-89<fcel>68.4<fcel>71.5<fcel>70.1<fcel>77.7<nl><rhed>Footnote<fcel>83-91<fcel>70.9<fcel>71.8<fcel>73.7<fcel>77.2<nl><rhed>Formula<fcel>83-85<fcel>60.1<fcel>63.4<fcel>63.5<fcel>66.2<nl><rhed>List-item<fcel>87-88<fcel>81.2<fcel>80.8<fcel>81.0<fcel>86.2<nl><rhed>Page-footer<fcel>93-94<fcel>61.6<fcel>59.3<fcel>58.9<fcel>61.1<nl><rhed>Page-header<fcel>85-89<fcel>71.9<fcel>70.0<fcel>72.0<fcel>67.9<nl><rhed>Picture<fcel>69-71<fcel>71.7<fcel>72.7<fcel>72.0<fcel>77.1<nl><rhed>Section-header<fcel>83-84<fcel>67.6<fcel>69.3<fcel>68.4<fcel>74.6<nl><rhed>Table<fcel>77-81<fcel>82.2<fcel>82.9<fcel>82.2<fcel>86.3<nl><rhed>Text<fcel>84-86<fcel>84.6<fcel>85.8<fcel>85.4<fcel>88.1<nl><rhed>Title<fcel>60-72<fcel>76.7<fcel>80.4<fcel>79.9<fcel>82.7<nl><rhed>All<fcel>82-83<fcel>72.4<fcel>73.5<fcel>73.4<fcel>76.8<nl></otsl>
<text><loc_44><loc_234><loc_241><loc_364>to avoid this at any cost in order to have clear, unbiased baseline numbers for human document-layout annotation. Third, we introduced the feature of snapping boxes around text segments to obtain a pixel-accurate annotation and again reduce time and effort. The CCS annotation tool automatically shrinks every user-drawn box to the minimum bounding-box around the enclosed text-cells for all purely text-based segments, which excludes only Table and Picture . For the latter, we instructed annotation staff to minimise inclusion of surrounding whitespace while including all graphical lines. A downside of snapping boxes to enclosed text cells is that some wrongly parsed PDF pages cannot be annotated correctly and need to be skipped. Fourth, we established a way to flag pages as rejected for cases where no valid annotation according to the label guidelines could be achieved. Example cases for this would be PDF pages that render incorrectly or contain layouts that are impossible to capture with non-overlapping rectangles. Such rejected pages are not contained in the final dataset. With all these measures in place, experienced annotation staff managed to annotate a single page in a typical timeframe of 20s to 60s, depending on its complexity.</text>
<section_header_level_1><loc_44><loc_371><loc_120><loc_378>5 EXPERIMENTS</section_header_level_1>
<text><loc_44><loc_387><loc_241><loc_448>The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</text>
@ -101,7 +101,7 @@
<page_header><loc_44><loc_38><loc_284><loc_43>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</page_header>
<page_header><loc_299><loc_38><loc_456><loc_43>KDD 22, August 14-18, 2022, Washington, DC, USA</page_header>
<text><loc_44><loc_55><loc_242><loc_81>Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.</text>
<otsl><loc_66><loc_95><loc_218><loc_187><ched>Class-count<ched>11<ched>6<ched>5<ched>4<nl><rhed>Caption<fcel>68<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Footnote<fcel>71<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Formula<fcel>60<fcel>Text<fcel>Text<fcel>Text<nl><rhed>List-item<fcel>81<fcel>Text<fcel>82<fcel>Text<nl><rhed>Page-footer<fcel>62<fcel>62<fcel>-<fcel>-<nl><rhed>Page-header<fcel>72<fcel>68<fcel>-<fcel>-<nl><rhed>Picture<fcel>72<fcel>72<fcel>72<fcel>72<nl><rhed>Section-header<fcel>68<fcel>67<fcel>69<fcel>68<nl><rhed>Table<fcel>82<fcel>83<fcel>82<fcel>82<nl><rhed>Text<fcel>85<fcel>84<fcel>84<fcel>84<nl><rhed>Title<fcel>77<fcel>Sec.-h.<fcel>Sec.-h.<fcel>Sec.-h.<nl><rhed>Overall<fcel>72<fcel>73<fcel>78<fcel>77<nl></otsl>
<otsl><loc_66><loc_95><loc_218><loc_187><fcel>Class-count<ched>11<ched>6<ched>5<ched>4<nl><rhed>Caption<fcel>68<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Footnote<fcel>71<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Formula<fcel>60<fcel>Text<fcel>Text<fcel>Text<nl><rhed>List-item<fcel>81<fcel>Text<fcel>82<fcel>Text<nl><rhed>Page-footer<fcel>62<fcel>62<fcel>-<fcel>-<nl><rhed>Page-header<fcel>72<fcel>68<fcel>-<fcel>-<nl><rhed>Picture<fcel>72<fcel>72<fcel>72<fcel>72<nl><rhed>Section-header<fcel>68<fcel>67<fcel>69<fcel>68<nl><rhed>Table<fcel>82<fcel>83<fcel>82<fcel>82<nl><rhed>Text<fcel>85<fcel>84<fcel>84<fcel>84<nl><rhed>Title<fcel>77<fcel>Sec.-h.<fcel>Sec.-h.<fcel>Sec.-h.<nl><rhed>Overall<fcel>72<fcel>73<fcel>78<fcel>77<nl></otsl>
<section_header_level_1><loc_44><loc_202><loc_107><loc_208>Learning Curve</section_header_level_1>
<text><loc_43><loc_211><loc_241><loc_334>One of the fundamental questions related to any dataset is if it is "large enough". To answer this question for DocLayNet, we performed a data ablation study in which we evaluated a Mask R-CNN model trained on increasing fractions of the DocLayNet dataset. As can be seen in Figure 5, the mAP score rises sharply in the beginning and eventually levels out. To estimate the error-bar on the metrics, we ran the training five times on the entire data-set. This resulted in a 1% error-bar, depicted by the shaded area in Figure 5. In the inset of Figure 5, we show the exact same data-points, but with a logarithmic scale on the x-axis. As is expected, the mAP score increases linearly as a function of the data-size in the inset. The curve ultimately flattens out between the 80% and 100% mark, with the 80% mark falling within the error-bars of the 100% mark. This provides a good indication that the model would not improve significantly by yet increasing the data size. Rather, it would probably benefit more from improved data consistency (as discussed in Section 3), data augmentation methods [23], or the addition of more document categories and styles.</text>
<section_header_level_1><loc_44><loc_342><loc_134><loc_349>Impact of Class Labels</section_header_level_1>
@ -116,7 +116,7 @@
<page_break>
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
<text><loc_44><loc_55><loc_242><loc_95>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.</text>
<otsl><loc_59><loc_109><loc_225><loc_215><ecel><ecel><ched>Testing on<lcel><lcel><nl><ched>Training on<ched>labels<ched>PLN<ched>DB<ched>DLN<nl><rhed>PubLayNet (PLN)<rhed>Figure<fcel>96<fcel>43<fcel>23<nl><ucel><rhed>Sec-header<fcel>87<fcel>-<fcel>32<nl><ucel><rhed>Table<fcel>95<fcel>24<fcel>49<nl><ucel><rhed>Text<fcel>96<fcel>-<fcel>42<nl><ucel><rhed>total<fcel>93<fcel>34<fcel>30<nl><rhed>DocBank (DB)<rhed>Figure<fcel>77<fcel>71<fcel>31<nl><ucel><rhed>Table<fcel>19<fcel>65<fcel>22<nl><ucel><rhed>total<fcel>48<fcel>68<fcel>27<nl><rhed>DocLayNet (DLN)<rhed>Figure<fcel>67<fcel>51<fcel>72<nl><ucel><rhed>Sec-header<fcel>53<fcel>-<fcel>68<nl><ucel><rhed>Table<fcel>87<fcel>43<fcel>82<nl><ucel><rhed>Text<fcel>77<fcel>-<fcel>84<nl><ucel><rhed>total<fcel>59<fcel>47<fcel>78<nl></otsl>
<otsl><loc_59><loc_109><loc_225><loc_215><ecel><ecel><ched>Testing on<lcel><lcel><nl><ched>Training on<ched>labels<ched>PLN<ched>DB<ched>DLN<nl><rhed>PubLayNet (PLN)<rhed>Figure<fcel>96<fcel>43<fcel>23<nl><ucel><rhed>Sec-header<fcel>87<fcel>-<fcel>32<nl><ecel><rhed>Table<fcel>95<fcel>24<fcel>49<nl><ecel><rhed>Text<fcel>96<fcel>-<fcel>42<nl><ecel><rhed>total<fcel>93<fcel>34<fcel>30<nl><rhed>DocBank (DB)<rhed>Figure<fcel>77<fcel>71<fcel>31<nl><ucel><rhed>Table<fcel>19<fcel>65<fcel>22<nl><ucel><rhed>total<fcel>48<fcel>68<fcel>27<nl><rhed>DocLayNet (DLN)<rhed>Figure<fcel>67<fcel>51<fcel>72<nl><ucel><rhed>Sec-header<fcel>53<fcel>-<fcel>68<nl><ecel><rhed>Table<fcel>87<fcel>43<fcel>82<nl><ecel><rhed>Text<fcel>77<fcel>-<fcel>84<nl><ecel><rhed>total<fcel>59<fcel>47<fcel>78<nl></otsl>
<text><loc_44><loc_247><loc_240><loc_280>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</text>
<text><loc_44><loc_281><loc_241><loc_370>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</text>
<section_header_level_1><loc_44><loc_382><loc_127><loc_388>Example Predictions</section_header_level_1>

File diff suppressed because one or more lines are too long

View File

@ -97,21 +97,21 @@ The annotation campaign was carried out in four phases. In phase one, we identif
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.
| | | % of Total | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|----------------|---------|--------------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
@ -154,7 +154,7 @@ Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on D
| | human | MRCNN | MRCNN | FRCNN | YOLO |
|----------------|---------|---------|---------|---------|--------|
| | human | R50 | R101 | R101 | v5x6 |
| | | R50 | R101 | R101 | v5x6 |
| Caption | 84-89 | 68.4 | 71.5 | 70.1 | 77.7 |
| Footnote | 83-91 | 70.9 | 71.8 | 73.7 | 77.2 |
| Formula | 83-85 | 60.1 | 63.4 | 63.5 | 66.2 |
@ -246,17 +246,17 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
| Training on | labels | PLN | DB | DLN |
| PubLayNet (PLN) | Figure | 96 | 43 | 23 |
| PubLayNet (PLN) | Sec-header | 87 | - | 32 |
| PubLayNet (PLN) | Table | 95 | 24 | 49 |
| PubLayNet (PLN) | Text | 96 | - | 42 |
| PubLayNet (PLN) | total | 93 | 34 | 30 |
| | Table | 95 | 24 | 49 |
| | Text | 96 | - | 42 |
| | total | 93 | 34 | 30 |
| DocBank (DB) | Figure | 77 | 71 | 31 |
| DocBank (DB) | Table | 19 | 65 | 22 |
| DocBank (DB) | total | 48 | 68 | 27 |
| DocLayNet (DLN) | Figure | 67 | 51 | 72 |
| DocLayNet (DLN) | Sec-header | 53 | - | 68 |
| DocLayNet (DLN) | Table | 87 | 43 | 82 |
| DocLayNet (DLN) | Text | 77 | - | 84 |
| DocLayNet (DLN) | total | 59 | 47 | 78 |
| | Table | 87 | 43 | 82 |
| | Text | 77 | - | 84 |
| | total | 59 | 47 | 78 |
Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .

File diff suppressed because one or more lines are too long

View File

@ -3,7 +3,7 @@
<text><loc_110><loc_74><loc_393><loc_97>order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.</text>
<section_header_level_1><loc_110><loc_105><loc_260><loc_113>5.1 Hyper Parameter Optimization</section_header_level_1>
<text><loc_110><loc_116><loc_393><loc_161>We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.</text>
<otsl><loc_114><loc_213><loc_388><loc_296><ched>#<ched>#<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ched>enc-layers<ched>dec-layers<ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938<fcel>0.904<fcel>0.927<fcel>0.853<fcel>1.97<nl><ecel><ecel><fcel>OTSL<fcel>0.952 0.923<fcel>0.909<fcel>0.938<fcel>0.843<fcel>3.77<nl><fcel>2<fcel>4<fcel>HTML<fcel>0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_172><loc_393><loc_207>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
<otsl><loc_114><loc_213><loc_388><loc_296><ched># enc-layers<ched># dec-layers<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ucel><ucel><ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904 0.909<fcel>0.927 0.938<fcel>0.853 0.843<fcel>1.97 3.77<nl><fcel>2<fcel>4<fcel>OTSL HTML<fcel>0.923 0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_172><loc_393><loc_207>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
<section_header_level_1><loc_110><loc_319><loc_216><loc_327>5.2 Quantitative Results</section_header_level_1>
<text><loc_110><loc_330><loc_393><loc_390>We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.</text>
<text><loc_110><loc_390><loc_393><loc_421>Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.</text>

File diff suppressed because one or more lines are too long

View File

@ -6,14 +6,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
| enc-layers | dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| 4 | 4 | OTSL HTML | 0.938 | 0.904 | 0.927 | 0.853 | 1.97 |
| | | OTSL | 0.952 0.923 | 0.909 | 0.938 | 0.843 | 3.77 |
| 2 | 4 | HTML | 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
| # enc-layers | # dec-layers | Language | TEDs | TEDs | TEDs | mAP | Inference |
|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
| # enc-layers | # dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77 |
| 2 | 4 | OTSL HTML | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
## 5.2 Quantitative Results

File diff suppressed because one or more lines are too long

View File

@ -89,14 +89,14 @@
<text><loc_110><loc_75><loc_393><loc_96>order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.</text>
<section_header_level_1><loc_110><loc_107><loc_260><loc_112>5.1 Hyper Parameter Optimization</section_header_level_1>
<text><loc_110><loc_117><loc_393><loc_160>We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.</text>
<otsl><loc_114><loc_213><loc_388><loc_296><ched>#<ched>#<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ched>enc-layers<ched>dec-layers<ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904<fcel>0.927<fcel>0.853<fcel>1.97<nl><fcel>2<fcel>4<fcel>OTSL<fcel>0.923 0.945<fcel>0.909 0.897<fcel>0.938<fcel>0.843<fcel>3.77<nl><ecel><ecel><fcel>HTML<ecel><fcel>0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_174><loc_393><loc_206>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
<otsl><loc_114><loc_213><loc_388><loc_296><ched># enc-layers<ched># dec-layers<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ucel><ucel><ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904 0.909<fcel>0.927 0.938<fcel>0.853 0.843<fcel>1.97 3.77<nl><fcel>2<fcel>4<fcel>OTSL HTML<fcel>0.923 0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_174><loc_393><loc_206>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
<section_header_level_1><loc_110><loc_321><loc_216><loc_326>5.2 Quantitative Results</section_header_level_1>
<text><loc_110><loc_331><loc_393><loc_390>We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.</text>
<text><loc_110><loc_392><loc_393><loc_420>Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.</text>
<page_break>
<page_header><loc_110><loc_59><loc_118><loc_64>10</page_header>
<page_header><loc_137><loc_59><loc_189><loc_64>M. Lysak, et al.</page_header>
<otsl><loc_117><loc_99><loc_385><loc_166><ecel><ched>Language<ched>TEDs<lcel><lcel><ched>mAP(0.75)<ched>Inference time (secs)<nl><ecel><ucel><ched>simple<ched>complex<ched>all<ucel><ucel><nl><rhed>PubTabNet<rhed>OTSL<fcel>0.965<fcel>0.934<fcel>0.955<fcel>0.88<fcel>2.73<nl><ucel><rhed>HTML<fcel>0.969<fcel>0.927<fcel>0.955<fcel>0.857<fcel>5.39<nl><rhed>FinTabNet<rhed>OTSL<fcel>0.955<fcel>0.961<fcel>0.959<fcel>0.862<fcel>1.85<nl><ucel><rhed>HTML<fcel>0.917<fcel>0.922<fcel>0.92<fcel>0.722<fcel>3.26<nl><rhed>PubTables-1M<rhed>OTSL<fcel>0.987<fcel>0.964<fcel>0.977<fcel>0.896<fcel>1.79<nl><ucel><rhed>HTML<fcel>0.983<fcel>0.944<fcel>0.966<fcel>0.889<fcel>3.26<nl><caption><loc_110><loc_73><loc_393><loc_92>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption></otsl>
<otsl><loc_117><loc_99><loc_385><loc_166><ched>Data set<ched>Language<ched>TEDs<lcel><lcel><ched>mAP(0.75)<ched>Inference time (secs)<nl><ucel><ucel><ched>simple<ched>complex<ched>all<ucel><ucel><nl><fcel>PubTabNet<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>FinTabNet<fcel>OTSL HTML<fcel>0.955 0.917<fcel>0.961 0.922<fcel>0.959 0.92<fcel>0.862 0.722<fcel>1.85 3.26<nl><fcel>PubTables-1M<fcel>OTSL HTML<fcel>0.987 0.983<fcel>0.964 0.944<fcel>0.977 0.966<fcel>0.896 0.889<fcel>1.79 3.26<nl><caption><loc_110><loc_73><loc_393><loc_92>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption></otsl>
<section_header_level_1><loc_110><loc_182><loc_210><loc_188>5.3 Qualitative Results</section_header_level_1>
<text><loc_110><loc_196><loc_393><loc_231>To illustrate the qualitative differences between OTSL and HTML, Figure 5 demonstrates less overlap and more accurate bounding boxes with OTSL. In Figure 6, OTSL proves to be more effective in handling tables with longer token sequences, resulting in even more precise structure prediction and bounding boxes.</text>
<picture><loc_133><loc_281><loc_369><loc_419><caption><loc_110><loc_251><loc_393><loc_278>Fig. 5. The OTSL model produces more accurate bounding boxes with less overlap (E) than the HTML model (D), when predicting the structure of a sparse table (A), at twice the inference speed because of shorter sequence length (B),(C). "PMC2807444_006_00.png" PubTabNet. μ</caption></picture>

File diff suppressed because one or more lines are too long

View File

@ -126,14 +126,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
| enc-layers | dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 | 0.927 | 0.853 | 1.97 |
| 2 | 4 | OTSL | 0.923 0.945 | 0.909 0.897 | 0.938 | 0.843 | 3.77 |
| | | HTML | | 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
| # enc-layers | # dec-layers | Language | TEDs | TEDs | TEDs | mAP | Inference |
|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
| # enc-layers | # dec-layers | Language | simple | complex | all | (0.75) | time (secs) |
| 6 | 6 | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| 4 | 4 | OTSL HTML | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77 |
| 2 | 4 | OTSL HTML | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81 |
| 4 | 2 | OTSL HTML | 0.952 0.944 | 0.92 0.903 | 0.942 0.931 | 0.857 0.824 | 1.22 2 |
## 5.2 Quantitative Results
@ -143,15 +142,12 @@ Additionally, the results show that OTSL has an advantage over HTML when applied
Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).
| | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
|--------------|------------|--------|---------|--------|-------------|-------------------------|
| | Language | simple | complex | all | mAP(0.75) | Inference time (secs) |
| PubTabNet | OTSL | 0.965 | 0.934 | 0.955 | 0.88 | 2.73 |
| PubTabNet | HTML | 0.969 | 0.927 | 0.955 | 0.857 | 5.39 |
| FinTabNet | OTSL | 0.955 | 0.961 | 0.959 | 0.862 | 1.85 |
| FinTabNet | HTML | 0.917 | 0.922 | 0.92 | 0.722 | 3.26 |
| PubTables-1M | OTSL | 0.987 | 0.964 | 0.977 | 0.896 | 1.79 |
| PubTables-1M | HTML | 0.983 | 0.944 | 0.966 | 0.889 | 3.26 |
| Data set | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
|--------------|------------|-------------|-------------|-------------|-------------|-------------------------|
| Data set | Language | simple | complex | all | mAP(0.75) | Inference time (secs) |
| PubTabNet | OTSL HTML | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857 | 2.73 5.39 |
| FinTabNet | OTSL HTML | 0.955 0.917 | 0.961 0.922 | 0.959 0.92 | 0.862 0.722 | 1.85 3.26 |
| PubTables-1M | OTSL HTML | 0.987 0.983 | 0.964 0.944 | 0.977 0.966 | 0.896 0.889 | 1.79 3.26 |
## 5.3 Qualitative Results

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -6,7 +6,7 @@ Empty unordered list:
Ordered list:
- bar
1. bar
Empty ordered list:

View File

@ -1,7 +1,7 @@
<doctag><section_header_level_1><loc_109><loc_79><loc_258><loc_87>JavaScript Code Example</section_header_level_1>
<text><loc_109><loc_94><loc_390><loc_183>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
<text><loc_109><loc_185><loc_390><loc_213>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,</text>
<code<loc_110><loc_231><loc_215><loc_257><_unknown_>function add(a, b) { return a + b; } console.log(add(3, 5));</code
<code><loc_110><loc_231><loc_215><loc_257><_unknown_>function add(a, b) { return a + b; } console.log(add(3, 5));</code>
<caption><loc_182><loc_221><loc_317><loc_226>Listing 1: Simple JavaScript Program</caption>
<text><loc_109><loc_265><loc_390><loc_353>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
<text><loc_109><loc_355><loc_390><loc_383>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,</text>

File diff suppressed because one or more lines are too long

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -63,7 +63,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -75,7 +75,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "3",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -87,7 +87,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "4",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -296,7 +296,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -308,7 +308,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -320,7 +320,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "3",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -332,7 +332,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "4",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
}

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "Index",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -63,7 +63,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "Customer Id",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -75,7 +75,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "First Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -87,7 +87,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "Last Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -99,7 +99,7 @@
"start_col_offset_idx": 4,
"end_col_offset_idx": 5,
"text": "Company",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -111,7 +111,7 @@
"start_col_offset_idx": 5,
"end_col_offset_idx": 6,
"text": "City",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -123,7 +123,7 @@
"start_col_offset_idx": 6,
"end_col_offset_idx": 7,
"text": "Country",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -135,7 +135,7 @@
"start_col_offset_idx": 7,
"end_col_offset_idx": 8,
"text": "Phone 1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -147,7 +147,7 @@
"start_col_offset_idx": 8,
"end_col_offset_idx": 9,
"text": "Phone 2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -159,7 +159,7 @@
"start_col_offset_idx": 9,
"end_col_offset_idx": 10,
"text": "Email",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -171,7 +171,7 @@
"start_col_offset_idx": 10,
"end_col_offset_idx": 11,
"text": "Subscription Date",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -183,7 +183,7 @@
"start_col_offset_idx": 11,
"end_col_offset_idx": 12,
"text": "Website",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -920,7 +920,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "Index",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -932,7 +932,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "Customer Id",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -944,7 +944,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "First Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -956,7 +956,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "Last Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -968,7 +968,7 @@
"start_col_offset_idx": 4,
"end_col_offset_idx": 5,
"text": "Company",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -980,7 +980,7 @@
"start_col_offset_idx": 5,
"end_col_offset_idx": 6,
"text": "City",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -992,7 +992,7 @@
"start_col_offset_idx": 6,
"end_col_offset_idx": 7,
"text": "Country",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1004,7 +1004,7 @@
"start_col_offset_idx": 7,
"end_col_offset_idx": 8,
"text": "Phone 1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1016,7 +1016,7 @@
"start_col_offset_idx": 8,
"end_col_offset_idx": 9,
"text": "Phone 2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1028,7 +1028,7 @@
"start_col_offset_idx": 9,
"end_col_offset_idx": 10,
"text": "Email",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1040,7 +1040,7 @@
"start_col_offset_idx": 10,
"end_col_offset_idx": 11,
"text": "Subscription Date",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1052,7 +1052,7 @@
"start_col_offset_idx": 11,
"end_col_offset_idx": 12,
"text": "Website",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
}

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -63,7 +63,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -75,7 +75,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "3",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -284,7 +284,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -296,7 +296,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -308,7 +308,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "3",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "Index",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -63,7 +63,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "Customer Id",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -75,7 +75,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "First Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -87,7 +87,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "Last Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -99,7 +99,7 @@
"start_col_offset_idx": 4,
"end_col_offset_idx": 5,
"text": "Company",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -111,7 +111,7 @@
"start_col_offset_idx": 5,
"end_col_offset_idx": 6,
"text": "City",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -123,7 +123,7 @@
"start_col_offset_idx": 6,
"end_col_offset_idx": 7,
"text": "Country",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -135,7 +135,7 @@
"start_col_offset_idx": 7,
"end_col_offset_idx": 8,
"text": "Phone 1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -147,7 +147,7 @@
"start_col_offset_idx": 8,
"end_col_offset_idx": 9,
"text": "Phone 2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -159,7 +159,7 @@
"start_col_offset_idx": 9,
"end_col_offset_idx": 10,
"text": "Email",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -171,7 +171,7 @@
"start_col_offset_idx": 10,
"end_col_offset_idx": 11,
"text": "Subscription Date",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -183,7 +183,7 @@
"start_col_offset_idx": 11,
"end_col_offset_idx": 12,
"text": "Website",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -920,7 +920,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "Index",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -932,7 +932,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "Customer Id",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -944,7 +944,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "First Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -956,7 +956,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "Last Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -968,7 +968,7 @@
"start_col_offset_idx": 4,
"end_col_offset_idx": 5,
"text": "Company",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -980,7 +980,7 @@
"start_col_offset_idx": 5,
"end_col_offset_idx": 6,
"text": "City",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -992,7 +992,7 @@
"start_col_offset_idx": 6,
"end_col_offset_idx": 7,
"text": "Country",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1004,7 +1004,7 @@
"start_col_offset_idx": 7,
"end_col_offset_idx": 8,
"text": "Phone 1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1016,7 +1016,7 @@
"start_col_offset_idx": 8,
"end_col_offset_idx": 9,
"text": "Phone 2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1028,7 +1028,7 @@
"start_col_offset_idx": 9,
"end_col_offset_idx": 10,
"text": "Email",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1040,7 +1040,7 @@
"start_col_offset_idx": 10,
"end_col_offset_idx": 11,
"text": "Subscription Date",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1052,7 +1052,7 @@
"start_col_offset_idx": 11,
"end_col_offset_idx": 12,
"text": "Website",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
}

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "Index",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -63,7 +63,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "Customer Id",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -75,7 +75,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "First Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -87,7 +87,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "Last Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -99,7 +99,7 @@
"start_col_offset_idx": 4,
"end_col_offset_idx": 5,
"text": "Company",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -111,7 +111,7 @@
"start_col_offset_idx": 5,
"end_col_offset_idx": 6,
"text": "City",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -123,7 +123,7 @@
"start_col_offset_idx": 6,
"end_col_offset_idx": 7,
"text": "Country",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -135,7 +135,7 @@
"start_col_offset_idx": 7,
"end_col_offset_idx": 8,
"text": "Phone 1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -147,7 +147,7 @@
"start_col_offset_idx": 8,
"end_col_offset_idx": 9,
"text": "Phone 2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -159,7 +159,7 @@
"start_col_offset_idx": 9,
"end_col_offset_idx": 10,
"text": "Email",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -171,7 +171,7 @@
"start_col_offset_idx": 10,
"end_col_offset_idx": 11,
"text": "Subscription Date",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -183,7 +183,7 @@
"start_col_offset_idx": 11,
"end_col_offset_idx": 12,
"text": "Website",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -920,7 +920,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "Index",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -932,7 +932,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "Customer Id",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -944,7 +944,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "First Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -956,7 +956,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "Last Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -968,7 +968,7 @@
"start_col_offset_idx": 4,
"end_col_offset_idx": 5,
"text": "Company",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -980,7 +980,7 @@
"start_col_offset_idx": 5,
"end_col_offset_idx": 6,
"text": "City",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -992,7 +992,7 @@
"start_col_offset_idx": 6,
"end_col_offset_idx": 7,
"text": "Country",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1004,7 +1004,7 @@
"start_col_offset_idx": 7,
"end_col_offset_idx": 8,
"text": "Phone 1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1016,7 +1016,7 @@
"start_col_offset_idx": 8,
"end_col_offset_idx": 9,
"text": "Phone 2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1028,7 +1028,7 @@
"start_col_offset_idx": 9,
"end_col_offset_idx": 10,
"text": "Email",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1040,7 +1040,7 @@
"start_col_offset_idx": 10,
"end_col_offset_idx": 11,
"text": "Subscription Date",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1052,7 +1052,7 @@
"start_col_offset_idx": 11,
"end_col_offset_idx": 12,
"text": "Website",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
}

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "Index",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -63,7 +63,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "Customer Id",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -75,7 +75,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "First Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -87,7 +87,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "Last Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -99,7 +99,7 @@
"start_col_offset_idx": 4,
"end_col_offset_idx": 5,
"text": "Company",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -111,7 +111,7 @@
"start_col_offset_idx": 5,
"end_col_offset_idx": 6,
"text": "City",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -123,7 +123,7 @@
"start_col_offset_idx": 6,
"end_col_offset_idx": 7,
"text": "Country",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -135,7 +135,7 @@
"start_col_offset_idx": 7,
"end_col_offset_idx": 8,
"text": "Phone 1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -147,7 +147,7 @@
"start_col_offset_idx": 8,
"end_col_offset_idx": 9,
"text": "Phone 2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -159,7 +159,7 @@
"start_col_offset_idx": 9,
"end_col_offset_idx": 10,
"text": "Email",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -171,7 +171,7 @@
"start_col_offset_idx": 10,
"end_col_offset_idx": 11,
"text": "Subscription Date",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -183,7 +183,7 @@
"start_col_offset_idx": 11,
"end_col_offset_idx": 12,
"text": "Website",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -920,7 +920,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "Index",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -932,7 +932,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "Customer Id",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -944,7 +944,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "First Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -956,7 +956,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "Last Name",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -968,7 +968,7 @@
"start_col_offset_idx": 4,
"end_col_offset_idx": 5,
"text": "Company",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -980,7 +980,7 @@
"start_col_offset_idx": 5,
"end_col_offset_idx": 6,
"text": "City",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -992,7 +992,7 @@
"start_col_offset_idx": 6,
"end_col_offset_idx": 7,
"text": "Country",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1004,7 +1004,7 @@
"start_col_offset_idx": 7,
"end_col_offset_idx": 8,
"text": "Phone 1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1016,7 +1016,7 @@
"start_col_offset_idx": 8,
"end_col_offset_idx": 9,
"text": "Phone 2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1028,7 +1028,7 @@
"start_col_offset_idx": 9,
"end_col_offset_idx": 10,
"text": "Email",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1040,7 +1040,7 @@
"start_col_offset_idx": 10,
"end_col_offset_idx": 11,
"text": "Subscription Date",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -1052,7 +1052,7 @@
"start_col_offset_idx": 11,
"end_col_offset_idx": 12,
"text": "Website",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
}

View File

@ -51,7 +51,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -63,7 +63,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -75,7 +75,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "3",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -87,7 +87,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "4",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -284,7 +284,7 @@
"start_col_offset_idx": 0,
"end_col_offset_idx": 1,
"text": "1",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -296,7 +296,7 @@
"start_col_offset_idx": 1,
"end_col_offset_idx": 2,
"text": "2",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -308,7 +308,7 @@
"start_col_offset_idx": 2,
"end_col_offset_idx": 3,
"text": "3",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
},
@ -320,7 +320,7 @@
"start_col_offset_idx": 3,
"end_col_offset_idx": 4,
"text": "4",
"column_header": false,
"column_header": true,
"row_header": false,
"row_section": false
}

Some files were not shown because too many files have changed in this diff Show More