mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-27 04:24:45 +00:00
Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
commit
412c013d95
2
.github/SECURITY.md
vendored
2
.github/SECURITY.md
vendored
@ -20,4 +20,4 @@ After the initial reply to your report, the security team will keep you informed
|
|||||||
|
|
||||||
## Security Alerts
|
## Security Alerts
|
||||||
|
|
||||||
We will send announcements of security vulnerabilities and steps to remediate on the [Docling announcements](https://github.com/DS4SD/docling/discussions/categories/announcements).
|
We will send announcements of security vulnerabilities and steps to remediate on the [Docling announcements](https://github.com/docling-project/docling/discussions/categories/announcements).
|
||||||
|
2
.github/workflows/ci-docs.yml
vendored
2
.github/workflows/ci-docs.yml
vendored
@ -10,7 +10,7 @@ on:
|
|||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
build-docs:
|
build-docs:
|
||||||
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
|
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
|
||||||
uses: ./.github/workflows/docs.yml
|
uses: ./.github/workflows/docs.yml
|
||||||
with:
|
with:
|
||||||
deploy: false
|
deploy: false
|
||||||
|
2
.github/workflows/ci.yml
vendored
2
.github/workflows/ci.yml
vendored
@ -15,5 +15,5 @@ env:
|
|||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
code-checks:
|
code-checks:
|
||||||
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
|
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
|
||||||
uses: ./.github/workflows/checks.yml
|
uses: ./.github/workflows/checks.yml
|
||||||
|
666
CHANGELOG.md
666
CHANGELOG.md
File diff suppressed because it is too large
Load Diff
@ -2,13 +2,13 @@
|
|||||||
Our project welcomes external contributions. If you have an itch, please feel
|
Our project welcomes external contributions. If you have an itch, please feel
|
||||||
free to scratch it.
|
free to scratch it.
|
||||||
|
|
||||||
To contribute code or documentation, please submit a [pull request](https://github.com/DS4SD/docling/pulls).
|
To contribute code or documentation, please submit a [pull request](https://github.com/docling-project/docling/pulls).
|
||||||
|
|
||||||
A good way to familiarize yourself with the codebase and contribution process is
|
A good way to familiarize yourself with the codebase and contribution process is
|
||||||
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/DS4SD/docling/issues).
|
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/docling-project/docling/issues).
|
||||||
Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us.
|
Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us.
|
||||||
|
|
||||||
For general questions or support requests, please refer to the [discussion section](https://github.com/DS4SD/docling/discussions).
|
For general questions or support requests, please refer to the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||||
|
|
||||||
**Note: We appreciate your effort and want to avoid situations where a contribution
|
**Note: We appreciate your effort and want to avoid situations where a contribution
|
||||||
requires extensive rework (by you or by us), sits in the backlog for a long time, or
|
requires extensive rework (by you or by us), sits in the backlog for a long time, or
|
||||||
@ -16,14 +16,14 @@ cannot be accepted at all!**
|
|||||||
|
|
||||||
### Proposing New Features
|
### Proposing New Features
|
||||||
|
|
||||||
If you would like to implement a new feature, please [raise an issue](https://github.com/DS4SD/docling/issues)
|
If you would like to implement a new feature, please [raise an issue](https://github.com/docling-project/docling/issues)
|
||||||
before sending a pull request so the feature can be discussed. This is to avoid
|
before sending a pull request so the feature can be discussed. This is to avoid
|
||||||
you spending valuable time working on a feature that the project developers
|
you spending valuable time working on a feature that the project developers
|
||||||
are not interested in accepting into the codebase.
|
are not interested in accepting into the codebase.
|
||||||
|
|
||||||
### Fixing Bugs
|
### Fixing Bugs
|
||||||
|
|
||||||
If you would like to fix a bug, please [raise an issue](https://github.com/DS4SD/docling/issues) before sending a
|
If you would like to fix a bug, please [raise an issue](https://github.com/docling-project/docling/issues) before sending a
|
||||||
pull request so it can be tracked.
|
pull request so it can be tracked.
|
||||||
|
|
||||||
### Merge Approval
|
### Merge Approval
|
||||||
@ -78,7 +78,7 @@ This project strictly adheres to using dependencies that are compatible with the
|
|||||||
|
|
||||||
## Communication
|
## Communication
|
||||||
|
|
||||||
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
|
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
28
README.md
28
README.md
@ -1,6 +1,6 @@
|
|||||||
<p align="center">
|
<p align="center">
|
||||||
<a href="https://github.com/ds4sd/docling">
|
<a href="https://github.com/docling-project/docling">
|
||||||
<img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
|
<img loading="lazy" alt="Docling" src="https://github.com/docling-project/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
|
||||||
</a>
|
</a>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
@ -11,7 +11,7 @@
|
|||||||
</p>
|
</p>
|
||||||
|
|
||||||
[](https://arxiv.org/abs/2408.09869)
|
[](https://arxiv.org/abs/2408.09869)
|
||||||
[](https://ds4sd.github.io/docling/)
|
[](https://docling-project.github.io/docling/)
|
||||||
[](https://pypi.org/project/docling/)
|
[](https://pypi.org/project/docling/)
|
||||||
[](https://pypi.org/project/docling/)
|
[](https://pypi.org/project/docling/)
|
||||||
[](https://python-poetry.org/)
|
[](https://python-poetry.org/)
|
||||||
@ -19,7 +19,7 @@
|
|||||||
[](https://pycqa.github.io/isort/)
|
[](https://pycqa.github.io/isort/)
|
||||||
[](https://pydantic.dev)
|
[](https://pydantic.dev)
|
||||||
[](https://github.com/pre-commit/pre-commit)
|
[](https://github.com/pre-commit/pre-commit)
|
||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
[](https://pepy.tech/projects/docling)
|
[](https://pepy.tech/projects/docling)
|
||||||
|
|
||||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||||
@ -51,7 +51,7 @@ pip install docling
|
|||||||
|
|
||||||
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
|
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
|
||||||
|
|
||||||
More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.
|
More [detailed installation instructions](https://docling-project.github.io/docling/installation/) are available in the docs.
|
||||||
|
|
||||||
## Getting started
|
## Getting started
|
||||||
|
|
||||||
@ -66,28 +66,28 @@ result = converter.convert(source)
|
|||||||
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
|
||||||
```
|
```
|
||||||
|
|
||||||
More [advanced usage options](https://ds4sd.github.io/docling/usage/) are available in
|
More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
|
||||||
the docs.
|
the docs.
|
||||||
|
|
||||||
## Documentation
|
## Documentation
|
||||||
|
|
||||||
Check out Docling's [documentation](https://ds4sd.github.io/docling/), for details on
|
Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
|
||||||
installation, usage, concepts, recipes, extensions, and more.
|
installation, usage, concepts, recipes, extensions, and more.
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
Go hands-on with our [examples](https://ds4sd.github.io/docling/examples/),
|
Go hands-on with our [examples](https://docling-project.github.io/docling/examples/),
|
||||||
demonstrating how to address different application use cases with Docling.
|
demonstrating how to address different application use cases with Docling.
|
||||||
|
|
||||||
## Integrations
|
## Integrations
|
||||||
|
|
||||||
To further accelerate your AI application development, check out Docling's native
|
To further accelerate your AI application development, check out Docling's native
|
||||||
[integrations](https://ds4sd.github.io/docling/integrations/) with popular frameworks
|
[integrations](https://docling-project.github.io/docling/integrations/) with popular frameworks
|
||||||
and tools.
|
and tools.
|
||||||
|
|
||||||
## Get help and support
|
## Get help and support
|
||||||
|
|
||||||
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
|
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||||
|
|
||||||
## Technical report
|
## Technical report
|
||||||
|
|
||||||
@ -95,7 +95,7 @@ For more details on Docling's inner workings, check out the [Docling Technical R
|
|||||||
|
|
||||||
## Contributing
|
## Contributing
|
||||||
|
|
||||||
Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
|
Please read [Contributing to Docling](https://github.com/docling-project/docling/blob/main/CONTRIBUTING.md) for details.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
@ -123,6 +123,6 @@ For individual model usage, please refer to the model licenses found in the orig
|
|||||||
|
|
||||||
Docling has been brought to you by IBM.
|
Docling has been brought to you by IBM.
|
||||||
|
|
||||||
[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/
|
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
|
||||||
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
|
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
|
||||||
[integrations]: https://ds4sd.github.io/docling/integrations/
|
[integrations]: https://docling-project.github.io/docling/integrations/
|
||||||
|
@ -380,7 +380,7 @@ class AsciiDocBackend(DeclarativeDocumentBackend):
|
|||||||
end_row_offset_idx=row_idx + row_span,
|
end_row_offset_idx=row_idx + row_span,
|
||||||
start_col_offset_idx=col_idx,
|
start_col_offset_idx=col_idx,
|
||||||
end_col_offset_idx=col_idx + col_span,
|
end_col_offset_idx=col_idx + col_span,
|
||||||
col_header=False,
|
column_header=row_idx == 0,
|
||||||
row_header=False,
|
row_header=False,
|
||||||
)
|
)
|
||||||
data.table_cells.append(cell)
|
data.table_cells.append(cell)
|
||||||
|
@ -111,7 +111,7 @@ class CsvDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
end_row_offset_idx=row_idx + 1,
|
end_row_offset_idx=row_idx + 1,
|
||||||
start_col_offset_idx=col_idx,
|
start_col_offset_idx=col_idx,
|
||||||
end_col_offset_idx=col_idx + 1,
|
end_col_offset_idx=col_idx + 1,
|
||||||
col_header=row_idx == 0, # First row as header
|
column_header=row_idx == 0, # First row as header
|
||||||
row_header=False,
|
row_header=False,
|
||||||
)
|
)
|
||||||
table_data.table_cells.append(cell)
|
table_data.table_cells.append(cell)
|
||||||
|
0
docling/backend/docx/__init__.py
Normal file
0
docling/backend/docx/__init__.py
Normal file
0
docling/backend/docx/latex/__init__.py
Normal file
0
docling/backend/docx/latex/__init__.py
Normal file
271
docling/backend/docx/latex/latex_dict.py
Normal file
271
docling/backend/docx/latex/latex_dict.py
Normal file
@ -0,0 +1,271 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
|
||||||
|
"""
|
||||||
|
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/latex_dict.py
|
||||||
|
On 23/01/2025
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
CHARS = ("{", "}", "_", "^", "#", "&", "$", "%", "~")
|
||||||
|
|
||||||
|
BLANK = ""
|
||||||
|
BACKSLASH = "\\"
|
||||||
|
ALN = "&"
|
||||||
|
|
||||||
|
CHR = {
|
||||||
|
# Unicode : Latex Math Symbols
|
||||||
|
# Top accents
|
||||||
|
"\u0300": "\\grave{{{0}}}",
|
||||||
|
"\u0301": "\\acute{{{0}}}",
|
||||||
|
"\u0302": "\\hat{{{0}}}",
|
||||||
|
"\u0303": "\\tilde{{{0}}}",
|
||||||
|
"\u0304": "\\bar{{{0}}}",
|
||||||
|
"\u0305": "\\overbar{{{0}}}",
|
||||||
|
"\u0306": "\\breve{{{0}}}",
|
||||||
|
"\u0307": "\\dot{{{0}}}",
|
||||||
|
"\u0308": "\\ddot{{{0}}}",
|
||||||
|
"\u0309": "\\ovhook{{{0}}}",
|
||||||
|
"\u030a": "\\ocirc{{{0}}}}",
|
||||||
|
"\u030c": "\\check{{{0}}}}",
|
||||||
|
"\u0310": "\\candra{{{0}}}",
|
||||||
|
"\u0312": "\\oturnedcomma{{{0}}}",
|
||||||
|
"\u0315": "\\ocommatopright{{{0}}}",
|
||||||
|
"\u031a": "\\droang{{{0}}}",
|
||||||
|
"\u0338": "\\not{{{0}}}",
|
||||||
|
"\u20d0": "\\leftharpoonaccent{{{0}}}",
|
||||||
|
"\u20d1": "\\rightharpoonaccent{{{0}}}",
|
||||||
|
"\u20d2": "\\vertoverlay{{{0}}}",
|
||||||
|
"\u20d6": "\\overleftarrow{{{0}}}",
|
||||||
|
"\u20d7": "\\vec{{{0}}}",
|
||||||
|
"\u20db": "\\dddot{{{0}}}",
|
||||||
|
"\u20dc": "\\ddddot{{{0}}}",
|
||||||
|
"\u20e1": "\\overleftrightarrow{{{0}}}",
|
||||||
|
"\u20e7": "\\annuity{{{0}}}",
|
||||||
|
"\u20e9": "\\widebridgeabove{{{0}}}",
|
||||||
|
"\u20f0": "\\asteraccent{{{0}}}",
|
||||||
|
# Bottom accents
|
||||||
|
"\u0330": "\\wideutilde{{{0}}}",
|
||||||
|
"\u0331": "\\underbar{{{0}}}",
|
||||||
|
"\u20e8": "\\threeunderdot{{{0}}}",
|
||||||
|
"\u20ec": "\\underrightharpoondown{{{0}}}",
|
||||||
|
"\u20ed": "\\underleftharpoondown{{{0}}}",
|
||||||
|
"\u20ee": "\\underledtarrow{{{0}}}",
|
||||||
|
"\u20ef": "\\underrightarrow{{{0}}}",
|
||||||
|
# Over | group
|
||||||
|
"\u23b4": "\\overbracket{{{0}}}",
|
||||||
|
"\u23dc": "\\overparen{{{0}}}",
|
||||||
|
"\u23de": "\\overbrace{{{0}}}",
|
||||||
|
# Under| group
|
||||||
|
"\u23b5": "\\underbracket{{{0}}}",
|
||||||
|
"\u23dd": "\\underparen{{{0}}}",
|
||||||
|
"\u23df": "\\underbrace{{{0}}}",
|
||||||
|
}
|
||||||
|
|
||||||
|
CHR_BO = {
|
||||||
|
# Big operators,
|
||||||
|
"\u2140": "\\Bbbsum",
|
||||||
|
"\u220f": "\\prod",
|
||||||
|
"\u2210": "\\coprod",
|
||||||
|
"\u2211": "\\sum",
|
||||||
|
"\u222b": "\\int",
|
||||||
|
"\u22c0": "\\bigwedge",
|
||||||
|
"\u22c1": "\\bigvee",
|
||||||
|
"\u22c2": "\\bigcap",
|
||||||
|
"\u22c3": "\\bigcup",
|
||||||
|
"\u2a00": "\\bigodot",
|
||||||
|
"\u2a01": "\\bigoplus",
|
||||||
|
"\u2a02": "\\bigotimes",
|
||||||
|
}
|
||||||
|
|
||||||
|
T = {
|
||||||
|
"\u2192": "\\rightarrow ",
|
||||||
|
# Greek letters
|
||||||
|
"\U0001d6fc": "\\alpha ",
|
||||||
|
"\U0001d6fd": "\\beta ",
|
||||||
|
"\U0001d6fe": "\\gamma ",
|
||||||
|
"\U0001d6ff": "\\theta ",
|
||||||
|
"\U0001d700": "\\epsilon ",
|
||||||
|
"\U0001d701": "\\zeta ",
|
||||||
|
"\U0001d702": "\\eta ",
|
||||||
|
"\U0001d703": "\\theta ",
|
||||||
|
"\U0001d704": "\\iota ",
|
||||||
|
"\U0001d705": "\\kappa ",
|
||||||
|
"\U0001d706": "\\lambda ",
|
||||||
|
"\U0001d707": "\\m ",
|
||||||
|
"\U0001d708": "\\n ",
|
||||||
|
"\U0001d709": "\\xi ",
|
||||||
|
"\U0001d70a": "\\omicron ",
|
||||||
|
"\U0001d70b": "\\pi ",
|
||||||
|
"\U0001d70c": "\\rho ",
|
||||||
|
"\U0001d70d": "\\varsigma ",
|
||||||
|
"\U0001d70e": "\\sigma ",
|
||||||
|
"\U0001d70f": "\\ta ",
|
||||||
|
"\U0001d710": "\\upsilon ",
|
||||||
|
"\U0001d711": "\\phi ",
|
||||||
|
"\U0001d712": "\\chi ",
|
||||||
|
"\U0001d713": "\\psi ",
|
||||||
|
"\U0001d714": "\\omega ",
|
||||||
|
"\U0001d715": "\\partial ",
|
||||||
|
"\U0001d716": "\\varepsilon ",
|
||||||
|
"\U0001d717": "\\vartheta ",
|
||||||
|
"\U0001d718": "\\varkappa ",
|
||||||
|
"\U0001d719": "\\varphi ",
|
||||||
|
"\U0001d71a": "\\varrho ",
|
||||||
|
"\U0001d71b": "\\varpi ",
|
||||||
|
# Relation symbols
|
||||||
|
"\u2190": "\\leftarrow ",
|
||||||
|
"\u2191": "\\uparrow ",
|
||||||
|
"\u2192": "\\rightarrow ",
|
||||||
|
"\u2193": "\\downright ",
|
||||||
|
"\u2194": "\\leftrightarrow ",
|
||||||
|
"\u2195": "\\updownarrow ",
|
||||||
|
"\u2196": "\\nwarrow ",
|
||||||
|
"\u2197": "\\nearrow ",
|
||||||
|
"\u2198": "\\searrow ",
|
||||||
|
"\u2199": "\\swarrow ",
|
||||||
|
"\u22ee": "\\vdots ",
|
||||||
|
"\u22ef": "\\cdots ",
|
||||||
|
"\u22f0": "\\adots ",
|
||||||
|
"\u22f1": "\\ddots ",
|
||||||
|
"\u2260": "\\ne ",
|
||||||
|
"\u2264": "\\leq ",
|
||||||
|
"\u2265": "\\geq ",
|
||||||
|
"\u2266": "\\leqq ",
|
||||||
|
"\u2267": "\\geqq ",
|
||||||
|
"\u2268": "\\lneqq ",
|
||||||
|
"\u2269": "\\gneqq ",
|
||||||
|
"\u226a": "\\ll ",
|
||||||
|
"\u226b": "\\gg ",
|
||||||
|
"\u2208": "\\in ",
|
||||||
|
"\u2209": "\\notin ",
|
||||||
|
"\u220b": "\\ni ",
|
||||||
|
"\u220c": "\\nni ",
|
||||||
|
# Ordinary symbols
|
||||||
|
"\u221e": "\\infty ",
|
||||||
|
# Binary relations
|
||||||
|
"\u00b1": "\\pm ",
|
||||||
|
"\u2213": "\\mp ",
|
||||||
|
# Italic, Latin, uppercase
|
||||||
|
"\U0001d434": "A",
|
||||||
|
"\U0001d435": "B",
|
||||||
|
"\U0001d436": "C",
|
||||||
|
"\U0001d437": "D",
|
||||||
|
"\U0001d438": "E",
|
||||||
|
"\U0001d439": "F",
|
||||||
|
"\U0001d43a": "G",
|
||||||
|
"\U0001d43b": "H",
|
||||||
|
"\U0001d43c": "I",
|
||||||
|
"\U0001d43d": "J",
|
||||||
|
"\U0001d43e": "K",
|
||||||
|
"\U0001d43f": "L",
|
||||||
|
"\U0001d440": "M",
|
||||||
|
"\U0001d441": "N",
|
||||||
|
"\U0001d442": "O",
|
||||||
|
"\U0001d443": "P",
|
||||||
|
"\U0001d444": "Q",
|
||||||
|
"\U0001d445": "R",
|
||||||
|
"\U0001d446": "S",
|
||||||
|
"\U0001d447": "T",
|
||||||
|
"\U0001d448": "U",
|
||||||
|
"\U0001d449": "V",
|
||||||
|
"\U0001d44a": "W",
|
||||||
|
"\U0001d44b": "X",
|
||||||
|
"\U0001d44c": "Y",
|
||||||
|
"\U0001d44d": "Z",
|
||||||
|
# Italic, Latin, lowercase
|
||||||
|
"\U0001d44e": "a",
|
||||||
|
"\U0001d44f": "b",
|
||||||
|
"\U0001d450": "c",
|
||||||
|
"\U0001d451": "d",
|
||||||
|
"\U0001d452": "e",
|
||||||
|
"\U0001d453": "f",
|
||||||
|
"\U0001d454": "g",
|
||||||
|
"\U0001d456": "i",
|
||||||
|
"\U0001d457": "j",
|
||||||
|
"\U0001d458": "k",
|
||||||
|
"\U0001d459": "l",
|
||||||
|
"\U0001d45a": "m",
|
||||||
|
"\U0001d45b": "n",
|
||||||
|
"\U0001d45c": "o",
|
||||||
|
"\U0001d45d": "p",
|
||||||
|
"\U0001d45e": "q",
|
||||||
|
"\U0001d45f": "r",
|
||||||
|
"\U0001d460": "s",
|
||||||
|
"\U0001d461": "t",
|
||||||
|
"\U0001d462": "u",
|
||||||
|
"\U0001d463": "v",
|
||||||
|
"\U0001d464": "w",
|
||||||
|
"\U0001d465": "x",
|
||||||
|
"\U0001d466": "y",
|
||||||
|
"\U0001d467": "z",
|
||||||
|
}
|
||||||
|
|
||||||
|
FUNC = {
|
||||||
|
"sin": "\\sin({fe})",
|
||||||
|
"cos": "\\cos({fe})",
|
||||||
|
"tan": "\\tan({fe})",
|
||||||
|
"arcsin": "\\arcsin({fe})",
|
||||||
|
"arccos": "\\arccos({fe})",
|
||||||
|
"arctan": "\\arctan({fe})",
|
||||||
|
"arccot": "\\arccot({fe})",
|
||||||
|
"sinh": "\\sinh({fe})",
|
||||||
|
"cosh": "\\cosh({fe})",
|
||||||
|
"tanh": "\\tanh({fe})",
|
||||||
|
"coth": "\\coth({fe})",
|
||||||
|
"sec": "\\sec({fe})",
|
||||||
|
"csc": "\\csc({fe})",
|
||||||
|
}
|
||||||
|
|
||||||
|
FUNC_PLACE = "{fe}"
|
||||||
|
|
||||||
|
BRK = "\\\\"
|
||||||
|
|
||||||
|
CHR_DEFAULT = {
|
||||||
|
"ACC_VAL": "\\hat{{{0}}}",
|
||||||
|
}
|
||||||
|
|
||||||
|
POS = {
|
||||||
|
"top": "\\overline{{{0}}}", # not sure
|
||||||
|
"bot": "\\underline{{{0}}}",
|
||||||
|
}
|
||||||
|
|
||||||
|
POS_DEFAULT = {
|
||||||
|
"BAR_VAL": "\\overline{{{0}}}",
|
||||||
|
}
|
||||||
|
|
||||||
|
SUB = "_{{{0}}}"
|
||||||
|
|
||||||
|
SUP = "^{{{0}}}"
|
||||||
|
|
||||||
|
F = {
|
||||||
|
"bar": "\\frac{{{num}}}{{{den}}}",
|
||||||
|
"skw": r"^{{{num}}}/_{{{den}}}",
|
||||||
|
"noBar": "\\genfrac{{}}{{}}{{0pt}}{{}}{{{num}}}{{{den}}}",
|
||||||
|
"lin": "{{{num}}}/{{{den}}}",
|
||||||
|
}
|
||||||
|
F_DEFAULT = "\\frac{{{num}}}{{{den}}}"
|
||||||
|
|
||||||
|
D = "\\left{left}{text}\\right{right}"
|
||||||
|
|
||||||
|
D_DEFAULT = {
|
||||||
|
"left": "(",
|
||||||
|
"right": ")",
|
||||||
|
"null": ".",
|
||||||
|
}
|
||||||
|
|
||||||
|
RAD = "\\sqrt[{deg}]{{{text}}}"
|
||||||
|
RAD_DEFAULT = "\\sqrt{{{text}}}"
|
||||||
|
ARR = "{text}"
|
||||||
|
|
||||||
|
LIM_FUNC = {
|
||||||
|
"lim": "\\lim_{{{lim}}}",
|
||||||
|
"max": "\\max_{{{lim}}}",
|
||||||
|
"min": "\\min_{{{lim}}}",
|
||||||
|
}
|
||||||
|
|
||||||
|
LIM_TO = ("\\rightarrow", "\\to")
|
||||||
|
|
||||||
|
LIM_UPP = "\\overset{{{lim}}}{{{text}}}"
|
||||||
|
|
||||||
|
M = "\\begin{{matrix}}{text}\\end{{matrix}}"
|
453
docling/backend/docx/latex/omml.py
Normal file
453
docling/backend/docx/latex/omml.py
Normal file
@ -0,0 +1,453 @@
|
|||||||
|
"""
|
||||||
|
Office Math Markup Language (OMML)
|
||||||
|
|
||||||
|
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/omml.py
|
||||||
|
On 23/01/2025
|
||||||
|
"""
|
||||||
|
|
||||||
|
import lxml.etree as ET
|
||||||
|
from pylatexenc.latexencode import UnicodeToLatexEncoder
|
||||||
|
|
||||||
|
from docling.backend.docx.latex.latex_dict import (
|
||||||
|
ALN,
|
||||||
|
ARR,
|
||||||
|
BACKSLASH,
|
||||||
|
BLANK,
|
||||||
|
BRK,
|
||||||
|
CHARS,
|
||||||
|
CHR,
|
||||||
|
CHR_BO,
|
||||||
|
CHR_DEFAULT,
|
||||||
|
D_DEFAULT,
|
||||||
|
F_DEFAULT,
|
||||||
|
FUNC,
|
||||||
|
FUNC_PLACE,
|
||||||
|
LIM_FUNC,
|
||||||
|
LIM_TO,
|
||||||
|
LIM_UPP,
|
||||||
|
POS,
|
||||||
|
POS_DEFAULT,
|
||||||
|
RAD,
|
||||||
|
RAD_DEFAULT,
|
||||||
|
SUB,
|
||||||
|
SUP,
|
||||||
|
D,
|
||||||
|
F,
|
||||||
|
M,
|
||||||
|
T,
|
||||||
|
)
|
||||||
|
|
||||||
|
OMML_NS = "{http://schemas.openxmlformats.org/officeDocument/2006/math}"
|
||||||
|
|
||||||
|
|
||||||
|
def load(stream):
|
||||||
|
tree = ET.parse(stream)
|
||||||
|
for omath in tree.findall(OMML_NS + "oMath"):
|
||||||
|
yield oMath2Latex(omath)
|
||||||
|
|
||||||
|
|
||||||
|
def load_string(string):
|
||||||
|
root = ET.fromstring(string)
|
||||||
|
for omath in root.findall(OMML_NS + "oMath"):
|
||||||
|
yield oMath2Latex(omath)
|
||||||
|
|
||||||
|
|
||||||
|
def escape_latex(strs):
|
||||||
|
last = None
|
||||||
|
new_chr = []
|
||||||
|
strs = strs.replace(r"\\", "\\")
|
||||||
|
for c in strs:
|
||||||
|
if (c in CHARS) and (last != BACKSLASH):
|
||||||
|
new_chr.append(BACKSLASH + c)
|
||||||
|
else:
|
||||||
|
new_chr.append(c)
|
||||||
|
last = c
|
||||||
|
return BLANK.join(new_chr)
|
||||||
|
|
||||||
|
|
||||||
|
def get_val(key, default=None, store=CHR):
|
||||||
|
if key is not None:
|
||||||
|
return key if not store else store.get(key, key)
|
||||||
|
else:
|
||||||
|
return default
|
||||||
|
|
||||||
|
|
||||||
|
class Tag2Method(object):
|
||||||
|
|
||||||
|
def call_method(self, elm, stag=None):
|
||||||
|
getmethod = self.tag2meth.get
|
||||||
|
if stag is None:
|
||||||
|
stag = elm.tag.replace(OMML_NS, "")
|
||||||
|
method = getmethod(stag)
|
||||||
|
if method:
|
||||||
|
return method(self, elm)
|
||||||
|
else:
|
||||||
|
return None
|
||||||
|
|
||||||
|
def process_children_list(self, elm, include=None):
|
||||||
|
"""
|
||||||
|
process children of the elm,return iterable
|
||||||
|
"""
|
||||||
|
for _e in list(elm):
|
||||||
|
if OMML_NS not in _e.tag:
|
||||||
|
continue
|
||||||
|
stag = _e.tag.replace(OMML_NS, "")
|
||||||
|
if include and (stag not in include):
|
||||||
|
continue
|
||||||
|
t = self.call_method(_e, stag=stag)
|
||||||
|
if t is None:
|
||||||
|
t = self.process_unknow(_e, stag)
|
||||||
|
if t is None:
|
||||||
|
continue
|
||||||
|
yield (stag, t, _e)
|
||||||
|
|
||||||
|
def process_children_dict(self, elm, include=None):
|
||||||
|
"""
|
||||||
|
process children of the elm,return dict
|
||||||
|
"""
|
||||||
|
latex_chars = dict()
|
||||||
|
for stag, t, e in self.process_children_list(elm, include):
|
||||||
|
latex_chars[stag] = t
|
||||||
|
return latex_chars
|
||||||
|
|
||||||
|
def process_children(self, elm, include=None):
|
||||||
|
"""
|
||||||
|
process children of the elm,return string
|
||||||
|
"""
|
||||||
|
return BLANK.join(
|
||||||
|
(
|
||||||
|
t if not isinstance(t, Tag2Method) else str(t)
|
||||||
|
for stag, t, e in self.process_children_list(elm, include)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def process_unknow(self, elm, stag):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
class Pr(Tag2Method):
|
||||||
|
|
||||||
|
text = ""
|
||||||
|
|
||||||
|
__val_tags = ("chr", "pos", "begChr", "endChr", "type")
|
||||||
|
|
||||||
|
__innerdict = None # can't use the __dict__
|
||||||
|
|
||||||
|
""" common properties of element"""
|
||||||
|
|
||||||
|
def __init__(self, elm):
|
||||||
|
self.__innerdict = {}
|
||||||
|
self.text = self.process_children(elm)
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return self.text
|
||||||
|
|
||||||
|
def __unicode__(self):
|
||||||
|
return self.__str__(self)
|
||||||
|
|
||||||
|
def __getattr__(self, name):
|
||||||
|
return self.__innerdict.get(name, None)
|
||||||
|
|
||||||
|
def do_brk(self, elm):
|
||||||
|
self.__innerdict["brk"] = BRK
|
||||||
|
return BRK
|
||||||
|
|
||||||
|
def do_common(self, elm):
|
||||||
|
stag = elm.tag.replace(OMML_NS, "")
|
||||||
|
if stag in self.__val_tags:
|
||||||
|
t = elm.get("{0}val".format(OMML_NS))
|
||||||
|
self.__innerdict[stag] = t
|
||||||
|
return None
|
||||||
|
|
||||||
|
tag2meth = {
|
||||||
|
"brk": do_brk,
|
||||||
|
"chr": do_common,
|
||||||
|
"pos": do_common,
|
||||||
|
"begChr": do_common,
|
||||||
|
"endChr": do_common,
|
||||||
|
"type": do_common,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class oMath2Latex(Tag2Method):
|
||||||
|
"""
|
||||||
|
Convert oMath element of omml to latex
|
||||||
|
"""
|
||||||
|
|
||||||
|
_t_dict = T
|
||||||
|
|
||||||
|
__direct_tags = ("box", "sSub", "sSup", "sSubSup", "num", "den", "deg", "e")
|
||||||
|
u = UnicodeToLatexEncoder(
|
||||||
|
replacement_latex_protection="braces-all",
|
||||||
|
unknown_char_policy="keep",
|
||||||
|
unknown_char_warning=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
def __init__(self, element):
|
||||||
|
self._latex = self.process_children(element)
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return self.latex.replace(" ", " ")
|
||||||
|
|
||||||
|
def __unicode__(self):
|
||||||
|
return self.__str__(self)
|
||||||
|
|
||||||
|
def process_unknow(self, elm, stag):
|
||||||
|
if stag in self.__direct_tags:
|
||||||
|
return self.process_children(elm)
|
||||||
|
elif stag[-2:] == "Pr":
|
||||||
|
return Pr(elm)
|
||||||
|
else:
|
||||||
|
return None
|
||||||
|
|
||||||
|
@property
|
||||||
|
def latex(self):
|
||||||
|
return self._latex
|
||||||
|
|
||||||
|
def do_acc(self, elm):
|
||||||
|
"""
|
||||||
|
the accent function
|
||||||
|
"""
|
||||||
|
c_dict = self.process_children_dict(elm)
|
||||||
|
latex_s = get_val(
|
||||||
|
c_dict["accPr"].chr, default=CHR_DEFAULT.get("ACC_VAL"), store=CHR
|
||||||
|
)
|
||||||
|
return latex_s.format(c_dict["e"])
|
||||||
|
|
||||||
|
def do_bar(self, elm):
|
||||||
|
"""
|
||||||
|
the bar function
|
||||||
|
"""
|
||||||
|
c_dict = self.process_children_dict(elm)
|
||||||
|
pr = c_dict["barPr"]
|
||||||
|
latex_s = get_val(pr.pos, default=POS_DEFAULT.get("BAR_VAL"), store=POS)
|
||||||
|
return pr.text + latex_s.format(c_dict["e"])
|
||||||
|
|
||||||
|
def do_d(self, elm):
|
||||||
|
"""
|
||||||
|
the delimiter object
|
||||||
|
"""
|
||||||
|
c_dict = self.process_children_dict(elm)
|
||||||
|
pr = c_dict["dPr"]
|
||||||
|
null = D_DEFAULT.get("null")
|
||||||
|
|
||||||
|
s_val = get_val(pr.begChr, default=D_DEFAULT.get("left"), store=T)
|
||||||
|
e_val = get_val(pr.endChr, default=D_DEFAULT.get("right"), store=T)
|
||||||
|
delim = pr.text + D.format(
|
||||||
|
left=null if not s_val else escape_latex(s_val),
|
||||||
|
text=c_dict["e"],
|
||||||
|
right=null if not e_val else escape_latex(e_val),
|
||||||
|
)
|
||||||
|
return delim
|
||||||
|
|
||||||
|
def do_spre(self, elm):
|
||||||
|
"""
|
||||||
|
the Pre-Sub-Superscript object -- Not support yet
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
def do_sub(self, elm):
|
||||||
|
text = self.process_children(elm)
|
||||||
|
return SUB.format(text)
|
||||||
|
|
||||||
|
def do_sup(self, elm):
|
||||||
|
text = self.process_children(elm)
|
||||||
|
return SUP.format(text)
|
||||||
|
|
||||||
|
def do_f(self, elm):
|
||||||
|
"""
|
||||||
|
the fraction object
|
||||||
|
"""
|
||||||
|
c_dict = self.process_children_dict(elm)
|
||||||
|
pr = c_dict["fPr"]
|
||||||
|
latex_s = get_val(pr.type, default=F_DEFAULT, store=F)
|
||||||
|
return pr.text + latex_s.format(num=c_dict.get("num"), den=c_dict.get("den"))
|
||||||
|
|
||||||
|
def do_func(self, elm):
|
||||||
|
"""
|
||||||
|
the Function-Apply object (Examples:sin cos)
|
||||||
|
"""
|
||||||
|
c_dict = self.process_children_dict(elm)
|
||||||
|
func_name = c_dict.get("fName")
|
||||||
|
return func_name.replace(FUNC_PLACE, c_dict.get("e"))
|
||||||
|
|
||||||
|
def do_fname(self, elm):
|
||||||
|
"""
|
||||||
|
the func name
|
||||||
|
"""
|
||||||
|
latex_chars = []
|
||||||
|
for stag, t, e in self.process_children_list(elm):
|
||||||
|
if stag == "r":
|
||||||
|
if FUNC.get(t):
|
||||||
|
latex_chars.append(FUNC[t])
|
||||||
|
else:
|
||||||
|
raise NotSupport("Not support func %s" % t)
|
||||||
|
else:
|
||||||
|
latex_chars.append(t)
|
||||||
|
t = BLANK.join(latex_chars)
|
||||||
|
return t if FUNC_PLACE in t else t + FUNC_PLACE # do_func will replace this
|
||||||
|
|
||||||
|
def do_groupchr(self, elm):
|
||||||
|
"""
|
||||||
|
the Group-Character object
|
||||||
|
"""
|
||||||
|
c_dict = self.process_children_dict(elm)
|
||||||
|
pr = c_dict["groupChrPr"]
|
||||||
|
latex_s = get_val(pr.chr)
|
||||||
|
return pr.text + latex_s.format(c_dict["e"])
|
||||||
|
|
||||||
|
def do_rad(self, elm):
|
||||||
|
"""
|
||||||
|
the radical object
|
||||||
|
"""
|
||||||
|
c_dict = self.process_children_dict(elm)
|
||||||
|
text = c_dict.get("e")
|
||||||
|
deg_text = c_dict.get("deg")
|
||||||
|
if deg_text:
|
||||||
|
return RAD.format(deg=deg_text, text=text)
|
||||||
|
else:
|
||||||
|
return RAD_DEFAULT.format(text=text)
|
||||||
|
|
||||||
|
def do_eqarr(self, elm):
|
||||||
|
"""
|
||||||
|
the Array object
|
||||||
|
"""
|
||||||
|
return ARR.format(
|
||||||
|
text=BRK.join(
|
||||||
|
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def do_limlow(self, elm):
|
||||||
|
"""
|
||||||
|
the Lower-Limit object
|
||||||
|
"""
|
||||||
|
t_dict = self.process_children_dict(elm, include=("e", "lim"))
|
||||||
|
latex_s = LIM_FUNC.get(t_dict["e"])
|
||||||
|
if not latex_s:
|
||||||
|
raise NotSupport("Not support lim %s" % t_dict["e"])
|
||||||
|
else:
|
||||||
|
return latex_s.format(lim=t_dict.get("lim"))
|
||||||
|
|
||||||
|
def do_limupp(self, elm):
|
||||||
|
"""
|
||||||
|
the Upper-Limit object
|
||||||
|
"""
|
||||||
|
t_dict = self.process_children_dict(elm, include=("e", "lim"))
|
||||||
|
return LIM_UPP.format(lim=t_dict.get("lim"), text=t_dict.get("e"))
|
||||||
|
|
||||||
|
def do_lim(self, elm):
|
||||||
|
"""
|
||||||
|
the lower limit of the limLow object and the upper limit of the limUpp function
|
||||||
|
"""
|
||||||
|
return self.process_children(elm).replace(LIM_TO[0], LIM_TO[1])
|
||||||
|
|
||||||
|
def do_m(self, elm):
|
||||||
|
"""
|
||||||
|
the Matrix object
|
||||||
|
"""
|
||||||
|
rows = []
|
||||||
|
for stag, t, e in self.process_children_list(elm):
|
||||||
|
if stag == "mPr":
|
||||||
|
pass
|
||||||
|
elif stag == "mr":
|
||||||
|
rows.append(t)
|
||||||
|
return M.format(text=BRK.join(rows))
|
||||||
|
|
||||||
|
def do_mr(self, elm):
|
||||||
|
"""
|
||||||
|
a single row of the matrix m
|
||||||
|
"""
|
||||||
|
return ALN.join(
|
||||||
|
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
|
||||||
|
)
|
||||||
|
|
||||||
|
def do_nary(self, elm):
|
||||||
|
"""
|
||||||
|
the n-ary object
|
||||||
|
"""
|
||||||
|
res = []
|
||||||
|
bo = ""
|
||||||
|
for stag, t, e in self.process_children_list(elm):
|
||||||
|
if stag == "naryPr":
|
||||||
|
bo = get_val(t.chr, store=CHR_BO)
|
||||||
|
else:
|
||||||
|
res.append(t)
|
||||||
|
return bo + BLANK.join(res)
|
||||||
|
|
||||||
|
def process_unicode(self, s):
|
||||||
|
# s = s if isinstance(s,unicode) else unicode(s,'utf-8')
|
||||||
|
# print(s, self._t_dict.get(s, s), unicode_to_latex(s))
|
||||||
|
# _str.append( self._t_dict.get(s, s) )
|
||||||
|
|
||||||
|
out_latex_str = self.u.unicode_to_latex(s)
|
||||||
|
|
||||||
|
# print(s, out_latex_str)
|
||||||
|
|
||||||
|
if (
|
||||||
|
s.startswith("{") is False
|
||||||
|
and out_latex_str.startswith("{")
|
||||||
|
and s.endswith("}") is False
|
||||||
|
and out_latex_str.endswith("}")
|
||||||
|
):
|
||||||
|
out_latex_str = f" {out_latex_str[1:-1]} "
|
||||||
|
|
||||||
|
# print(s, out_latex_str)
|
||||||
|
|
||||||
|
if "ensuremath" in out_latex_str:
|
||||||
|
out_latex_str = out_latex_str.replace("\\ensuremath{", " ")
|
||||||
|
out_latex_str = out_latex_str.replace("}", " ")
|
||||||
|
|
||||||
|
# print(s, out_latex_str)
|
||||||
|
|
||||||
|
if out_latex_str.strip().startswith("\\text"):
|
||||||
|
out_latex_str = f" \\text{{{out_latex_str}}} "
|
||||||
|
|
||||||
|
# print(s, out_latex_str)
|
||||||
|
|
||||||
|
return out_latex_str
|
||||||
|
|
||||||
|
def do_r(self, elm):
|
||||||
|
"""
|
||||||
|
Get text from 'r' element,And try convert them to latex symbols
|
||||||
|
@todo text style support , (sty)
|
||||||
|
@todo \text (latex pure text support)
|
||||||
|
"""
|
||||||
|
_str = []
|
||||||
|
_base_str = []
|
||||||
|
for s in elm.findtext("./{0}t".format(OMML_NS)):
|
||||||
|
out_latex_str = self.process_unicode(s)
|
||||||
|
_str.append(out_latex_str)
|
||||||
|
_base_str.append(s)
|
||||||
|
|
||||||
|
proc_str = escape_latex(BLANK.join(_str))
|
||||||
|
base_proc_str = BLANK.join(_base_str)
|
||||||
|
|
||||||
|
if "{" not in base_proc_str and "\\{" in proc_str:
|
||||||
|
proc_str = proc_str.replace("\\{", "{")
|
||||||
|
|
||||||
|
if "}" not in base_proc_str and "\\}" in proc_str:
|
||||||
|
proc_str = proc_str.replace("\\}", "}")
|
||||||
|
|
||||||
|
return proc_str
|
||||||
|
|
||||||
|
tag2meth = {
|
||||||
|
"acc": do_acc,
|
||||||
|
"r": do_r,
|
||||||
|
"bar": do_bar,
|
||||||
|
"sub": do_sub,
|
||||||
|
"sup": do_sup,
|
||||||
|
"f": do_f,
|
||||||
|
"func": do_func,
|
||||||
|
"fName": do_fname,
|
||||||
|
"groupChr": do_groupchr,
|
||||||
|
"d": do_d,
|
||||||
|
"rad": do_rad,
|
||||||
|
"eqArr": do_eqarr,
|
||||||
|
"limLow": do_limlow,
|
||||||
|
"limUpp": do_limupp,
|
||||||
|
"lim": do_lim,
|
||||||
|
"m": do_m,
|
||||||
|
"mr": do_mr,
|
||||||
|
"nary": do_nary,
|
||||||
|
}
|
@ -134,7 +134,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.analyze_tag(cast(Tag, element), doc)
|
self.analyze_tag(cast(Tag, element), doc)
|
||||||
except Exception as exc_child:
|
except Exception as exc_child:
|
||||||
_log.error(
|
_log.error(
|
||||||
f"Error processing child from tag{tag.name}: {exc_child}"
|
f"Error processing child from tag {tag.name}: {repr(exc_child)}"
|
||||||
)
|
)
|
||||||
raise exc_child
|
raise exc_child
|
||||||
elif isinstance(element, NavigableString) and not isinstance(
|
elif isinstance(element, NavigableString) and not isinstance(
|
||||||
@ -347,11 +347,11 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
content_layer=self.content_layer,
|
content_layer=self.content_layer,
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
|
|
||||||
self.walk(element, doc)
|
self.walk(element, doc)
|
||||||
|
|
||||||
self.parents[self.level + 1] = None
|
self.parents[self.level + 1] = None
|
||||||
self.level -= 1
|
self.level -= 1
|
||||||
|
else:
|
||||||
|
self.walk(element, doc)
|
||||||
|
|
||||||
elif element.text.strip():
|
elif element.text.strip():
|
||||||
text = element.text.strip()
|
text = element.text.strip()
|
||||||
@ -457,7 +457,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
end_row_offset_idx=row_idx + row_span,
|
end_row_offset_idx=row_idx + row_span,
|
||||||
start_col_offset_idx=col_idx,
|
start_col_offset_idx=col_idx,
|
||||||
end_col_offset_idx=col_idx + col_span,
|
end_col_offset_idx=col_idx + col_span,
|
||||||
col_header=col_header,
|
column_header=col_header,
|
||||||
row_header=((not col_header) and html_cell.name == "th"),
|
row_header=((not col_header) and html_cell.name == "th"),
|
||||||
)
|
)
|
||||||
data.table_cells.append(table_cell)
|
data.table_cells.append(table_cell)
|
||||||
|
@ -136,7 +136,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
end_row_offset_idx=trow_ind + row_span,
|
end_row_offset_idx=trow_ind + row_span,
|
||||||
start_col_offset_idx=tcol_ind,
|
start_col_offset_idx=tcol_ind,
|
||||||
end_col_offset_idx=tcol_ind + col_span,
|
end_col_offset_idx=tcol_ind + col_span,
|
||||||
col_header=False,
|
column_header=trow_ind == 0,
|
||||||
row_header=False,
|
row_header=False,
|
||||||
)
|
)
|
||||||
tcells.append(icell)
|
tcells.append(icell)
|
||||||
|
@ -164,7 +164,7 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
end_row_offset_idx=excel_cell.row + excel_cell.row_span,
|
end_row_offset_idx=excel_cell.row + excel_cell.row_span,
|
||||||
start_col_offset_idx=excel_cell.col,
|
start_col_offset_idx=excel_cell.col,
|
||||||
end_col_offset_idx=excel_cell.col + excel_cell.col_span,
|
end_col_offset_idx=excel_cell.col + excel_cell.col_span,
|
||||||
col_header=False,
|
column_header=excel_cell.row == 0,
|
||||||
row_header=False,
|
row_header=False,
|
||||||
)
|
)
|
||||||
table_data.table_cells.append(cell)
|
table_data.table_cells.append(cell)
|
||||||
@ -173,7 +173,7 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def _find_data_tables(self, sheet: Worksheet):
|
def _find_data_tables(self, sheet: Worksheet) -> List[ExcelTable]:
|
||||||
"""
|
"""
|
||||||
Find all compact rectangular data tables in a sheet.
|
Find all compact rectangular data tables in a sheet.
|
||||||
"""
|
"""
|
||||||
@ -340,47 +340,4 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
except:
|
except:
|
||||||
_log.error("could not extract the image from excel sheets")
|
_log.error("could not extract the image from excel sheets")
|
||||||
|
|
||||||
"""
|
|
||||||
for idx, chart in enumerate(sheet._charts): # type: ignore
|
|
||||||
try:
|
|
||||||
chart_path = f"chart_{idx + 1}.png"
|
|
||||||
_log.info(
|
|
||||||
f"Chart found, but dynamic rendering is required for: {chart_path}"
|
|
||||||
)
|
|
||||||
|
|
||||||
_log.info(f"Chart {idx + 1}:")
|
|
||||||
|
|
||||||
# Chart type
|
|
||||||
# _log.info(f"Type: {type(chart).__name__}")
|
|
||||||
print(f"Type: {type(chart).__name__}")
|
|
||||||
|
|
||||||
# Extract series data
|
|
||||||
for series_idx, series in enumerate(chart.series):
|
|
||||||
#_log.info(f"Series {series_idx + 1}:")
|
|
||||||
print(f"Series {series_idx + 1} type: {type(series).__name__}")
|
|
||||||
#print(f"x-values: {series.xVal}")
|
|
||||||
#print(f"y-values: {series.yVal}")
|
|
||||||
|
|
||||||
print(f"xval type: {type(series.xVal).__name__}")
|
|
||||||
|
|
||||||
xvals = []
|
|
||||||
for _ in series.xVal.numLit.pt:
|
|
||||||
print(f"xval type: {type(_).__name__}")
|
|
||||||
if hasattr(_, 'v'):
|
|
||||||
xvals.append(_.v)
|
|
||||||
|
|
||||||
print(f"x-values: {xvals}")
|
|
||||||
|
|
||||||
yvals = []
|
|
||||||
for _ in series.yVal:
|
|
||||||
if hasattr(_, 'v'):
|
|
||||||
yvals.append(_.v)
|
|
||||||
|
|
||||||
print(f"y-values: {yvals}")
|
|
||||||
|
|
||||||
except Exception as exc:
|
|
||||||
print(exc)
|
|
||||||
continue
|
|
||||||
"""
|
|
||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
@ -346,7 +346,7 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
end_row_offset_idx=row_idx + row_span,
|
end_row_offset_idx=row_idx + row_span,
|
||||||
start_col_offset_idx=col_idx,
|
start_col_offset_idx=col_idx,
|
||||||
end_col_offset_idx=col_idx + col_span,
|
end_col_offset_idx=col_idx + col_span,
|
||||||
col_header=False,
|
column_header=row_idx == 0,
|
||||||
row_header=False,
|
row_header=False,
|
||||||
)
|
)
|
||||||
if len(cell.text.strip()) > 0:
|
if len(cell.text.strip()) > 0:
|
||||||
|
@ -26,6 +26,7 @@ from PIL import Image, UnidentifiedImageError
|
|||||||
from typing_extensions import override
|
from typing_extensions import override
|
||||||
|
|
||||||
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
||||||
|
from docling.backend.docx.latex.omml import oMath2Latex
|
||||||
from docling.datamodel.base_models import InputFormat
|
from docling.datamodel.base_models import InputFormat
|
||||||
from docling.datamodel.document import InputDocument
|
from docling.datamodel.document import InputDocument
|
||||||
|
|
||||||
@ -260,6 +261,25 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
else:
|
else:
|
||||||
return label, None
|
return label, None
|
||||||
|
|
||||||
|
def handle_equations_in_text(self, element, text):
|
||||||
|
only_texts = []
|
||||||
|
only_equations = []
|
||||||
|
texts_and_equations = []
|
||||||
|
for subt in element.iter():
|
||||||
|
tag_name = etree.QName(subt).localname
|
||||||
|
if tag_name == "t" and "math" not in subt.tag:
|
||||||
|
only_texts.append(subt.text)
|
||||||
|
texts_and_equations.append(subt.text)
|
||||||
|
elif "oMath" in subt.tag and "oMathPara" not in subt.tag:
|
||||||
|
latex_equation = str(oMath2Latex(subt))
|
||||||
|
only_equations.append(latex_equation)
|
||||||
|
texts_and_equations.append(latex_equation)
|
||||||
|
|
||||||
|
if "".join(only_texts) != text:
|
||||||
|
return text
|
||||||
|
|
||||||
|
return "".join(texts_and_equations), only_equations
|
||||||
|
|
||||||
def handle_text_elements(
|
def handle_text_elements(
|
||||||
self,
|
self,
|
||||||
element: BaseOxmlElement,
|
element: BaseOxmlElement,
|
||||||
@ -268,9 +288,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
) -> None:
|
) -> None:
|
||||||
paragraph = Paragraph(element, docx_obj)
|
paragraph = Paragraph(element, docx_obj)
|
||||||
|
|
||||||
if paragraph.text is None:
|
raw_text = paragraph.text
|
||||||
|
text, equations = self.handle_equations_in_text(element=element, text=raw_text)
|
||||||
|
|
||||||
|
if text is None:
|
||||||
return
|
return
|
||||||
text = paragraph.text.strip()
|
text = text.strip()
|
||||||
|
|
||||||
# Common styles for bullet and numbered lists.
|
# Common styles for bullet and numbered lists.
|
||||||
# "List Bullet", "List Number", "List Paragraph"
|
# "List Bullet", "List Number", "List Paragraph"
|
||||||
@ -323,6 +346,45 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
elif "Heading" in p_style_id:
|
elif "Heading" in p_style_id:
|
||||||
self.add_header(doc, p_level, text)
|
self.add_header(doc, p_level, text)
|
||||||
|
|
||||||
|
elif len(equations) > 0:
|
||||||
|
if (raw_text is None or len(raw_text) == 0) and len(text) > 0:
|
||||||
|
# Standalone equation
|
||||||
|
level = self.get_level()
|
||||||
|
doc.add_text(
|
||||||
|
label=DocItemLabel.FORMULA,
|
||||||
|
parent=self.parents[level - 1],
|
||||||
|
text=text,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# Inline equation
|
||||||
|
level = self.get_level()
|
||||||
|
inline_equation = doc.add_group(
|
||||||
|
label=GroupLabel.INLINE, parent=self.parents[level - 1]
|
||||||
|
)
|
||||||
|
text_tmp = text
|
||||||
|
for eq in equations:
|
||||||
|
if len(text_tmp) == 0:
|
||||||
|
break
|
||||||
|
pre_eq_text = text_tmp.split(eq, maxsplit=1)[0]
|
||||||
|
text_tmp = text_tmp.split(eq, maxsplit=1)[1]
|
||||||
|
if len(pre_eq_text) > 0:
|
||||||
|
doc.add_text(
|
||||||
|
label=DocItemLabel.PARAGRAPH,
|
||||||
|
parent=inline_equation,
|
||||||
|
text=pre_eq_text,
|
||||||
|
)
|
||||||
|
doc.add_text(
|
||||||
|
label=DocItemLabel.FORMULA,
|
||||||
|
parent=inline_equation,
|
||||||
|
text=eq,
|
||||||
|
)
|
||||||
|
if len(text_tmp) > 0:
|
||||||
|
doc.add_text(
|
||||||
|
label=DocItemLabel.PARAGRAPH,
|
||||||
|
parent=inline_equation,
|
||||||
|
text=text_tmp,
|
||||||
|
)
|
||||||
|
|
||||||
elif p_style_id in [
|
elif p_style_id in [
|
||||||
"Paragraph",
|
"Paragraph",
|
||||||
"Normal",
|
"Normal",
|
||||||
@ -539,7 +601,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
end_row_offset_idx=row.grid_cols_before + spanned_idx,
|
end_row_offset_idx=row.grid_cols_before + spanned_idx,
|
||||||
start_col_offset_idx=col_idx,
|
start_col_offset_idx=col_idx,
|
||||||
end_col_offset_idx=col_idx + cell.grid_span,
|
end_col_offset_idx=col_idx + cell.grid_span,
|
||||||
col_header=False,
|
column_header=row.grid_cols_before + row_idx == 0,
|
||||||
row_header=False,
|
row_header=False,
|
||||||
)
|
)
|
||||||
data.table_cells.append(table_cell)
|
data.table_cells.append(table_cell)
|
||||||
|
@ -121,7 +121,7 @@ def download(
|
|||||||
"Using the CLI:",
|
"Using the CLI:",
|
||||||
f"`docling --artifacts-path={output_dir} FILE`",
|
f"`docling --artifacts-path={output_dir} FILE`",
|
||||||
"\n",
|
"\n",
|
||||||
"Using Python: see the documentation at <https://ds4sd.github.io/docling/usage>.",
|
"Using Python: see the documentation at <https://docling-project.github.io/docling/usage>.",
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@ -27,7 +27,7 @@ class OcrMacModel(BaseOcrModel):
|
|||||||
"ocrmac is not correctly installed. "
|
"ocrmac is not correctly installed. "
|
||||||
"Please install it via `pip install ocrmac` to use this OCR engine. "
|
"Please install it via `pip install ocrmac` to use this OCR engine. "
|
||||||
"Alternatively, Docling has support for other OCR engines. See the documentation: "
|
"Alternatively, Docling has support for other OCR engines. See the documentation: "
|
||||||
"https://ds4sd.github.io/docling/installation/"
|
"https://docling-project.github.io/docling/installation/"
|
||||||
)
|
)
|
||||||
try:
|
try:
|
||||||
from ocrmac import ocrmac
|
from ocrmac import ocrmac
|
||||||
|
@ -32,14 +32,14 @@ class TesseractOcrModel(BaseOcrModel):
|
|||||||
"Note that tesserocr might have to be manually compiled for working with "
|
"Note that tesserocr might have to be manually compiled for working with "
|
||||||
"your Tesseract installation. The Docling documentation provides examples for it. "
|
"your Tesseract installation. The Docling documentation provides examples for it. "
|
||||||
"Alternatively, Docling has support for other OCR engines. See the documentation: "
|
"Alternatively, Docling has support for other OCR engines. See the documentation: "
|
||||||
"https://ds4sd.github.io/docling/installation/"
|
"https://docling-project.github.io/docling/installation/"
|
||||||
)
|
)
|
||||||
missing_langs_errmsg = (
|
missing_langs_errmsg = (
|
||||||
"tesserocr is not correctly configured. No language models have been detected. "
|
"tesserocr is not correctly configured. No language models have been detected. "
|
||||||
"Please ensure that the TESSDATA_PREFIX envvar points to tesseract languages dir. "
|
"Please ensure that the TESSDATA_PREFIX envvar points to tesseract languages dir. "
|
||||||
"You can find more information how to setup other OCR engines in Docling "
|
"You can find more information how to setup other OCR engines in Docling "
|
||||||
"documentation: "
|
"documentation: "
|
||||||
"https://ds4sd.github.io/docling/installation/"
|
"https://docling-project.github.io/docling/installation/"
|
||||||
)
|
)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
@ -7,7 +7,7 @@ pydantic datatype, which can express several features common to documents, such
|
|||||||
* Layout information (i.e. bounding boxes) for all items, if available
|
* Layout information (i.e. bounding boxes) for all items, if available
|
||||||
* Provenance information
|
* Provenance information
|
||||||
|
|
||||||
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/DS4SD/docling-core/tree/main/docling_core/types/doc).
|
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/docling-project/docling-core/tree/main/docling_core/types/doc).
|
||||||
|
|
||||||
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
|
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
|
||||||
|
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/backend_xml_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -36,7 +36,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"This is an example of using [Docling](https://ds4sd.github.io/docling/) for converting structured data (XML) into a unified document\n",
|
"This is an example of using [Docling](https://docling-project.github.io/docling/) for converting structured data (XML) into a unified document\n",
|
||||||
"representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n",
|
"representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n",
|
"Data used in this example consist of patents from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov/) and medical\n",
|
||||||
|
@ -103,7 +103,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://ds4sd.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
|
"> 👉 **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check [here](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model)."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
@ -321,7 +321,7 @@
|
|||||||
],
|
],
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"kernelspec": {
|
"kernelspec": {
|
||||||
"display_name": "docling-aMWN2FRM-py3.12",
|
"display_name": "docling-hgXEfXco-py3.12",
|
||||||
"language": "python",
|
"language": "python",
|
||||||
"name": "python3"
|
"name": "python3"
|
||||||
},
|
},
|
||||||
|
@ -36,7 +36,7 @@
|
|||||||
"## A recipe 🧑🍳 🐥 💚\n",
|
"## A recipe 🧑🍳 🐥 💚\n",
|
||||||
"\n",
|
"\n",
|
||||||
"This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:\n",
|
"This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:\n",
|
||||||
"- [Docling](https://ds4sd.github.io/docling/) for document parsing and chunking\n",
|
"- [Docling](https://docling-project.github.io/docling/) for document parsing and chunking\n",
|
||||||
"- [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search/?msockid=0109678bea39665431e37323ebff6723) for vector indexing and retrieval\n",
|
"- [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search/?msockid=0109678bea39665431e37323ebff6723) for vector indexing and retrieval\n",
|
||||||
"- [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service?msockid=0109678bea39665431e37323ebff6723) for embeddings and chat completion\n",
|
"- [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service?msockid=0109678bea39665431e37323ebff6723) for embeddings and chat completion\n",
|
||||||
"\n",
|
"\n",
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_haystack.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -247,7 +247,7 @@
|
|||||||
"name": "stderr",
|
"name": "stderr",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
|
"/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n",
|
||||||
" warnings.warn(\n"
|
" warnings.warn(\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -168,7 +168,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"> Note: a message saying `\"Token indices sequence length is longer than the specified\n",
|
"> Note: a message saying `\"Token indices sequence length is longer than the specified\n",
|
||||||
"maximum sequence length...\"` can be ignored in this case — details\n",
|
"maximum sequence length...\"` can be ignored in this case — details\n",
|
||||||
"[here](https://github.com/DS4SD/docling-core/issues/119#issuecomment-2577418826)."
|
"[here](https://github.com/docling-project/docling-core/issues/119#issuecomment-2577418826)."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"[](https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
|
"[](https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -29,7 +29,7 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"## A recipe 🧑🍳 🐥 💚\n",
|
"## A recipe 🧑🍳 🐥 💚\n",
|
||||||
"\n",
|
"\n",
|
||||||
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://ds4sd.github.io/docling/).\n",
|
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://docling-project.github.io/docling/).\n",
|
||||||
"\n",
|
"\n",
|
||||||
"In this notebook, we accomplish the following:\n",
|
"In this notebook, we accomplish the following:\n",
|
||||||
"* Parse the top machine learning papers on [arXiv](https://arxiv.org/) using Docling\n",
|
"* Parse the top machine learning papers on [arXiv](https://arxiv.org/) using Docling\n",
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
|
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
|
||||||
".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@ -109,7 +109,7 @@
|
|||||||
"name": "stderr",
|
"name": "stderr",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
|
"/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
|
||||||
" warnings.warn(\n"
|
" warnings.warn(\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
# FAQ
|
# FAQ
|
||||||
|
|
||||||
This is a collection of FAQ collected from the user questions on <https://github.com/DS4SD/docling/discussions>.
|
This is a collection of FAQ collected from the user questions on <https://github.com/docling-project/docling/discussions>.
|
||||||
|
|
||||||
|
|
||||||
??? question "Is Python 3.13 supported?"
|
??? question "Is Python 3.13 supported?"
|
||||||
@ -41,7 +41,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
|||||||
]
|
]
|
||||||
```
|
```
|
||||||
|
|
||||||
Source: Issue [#283](https://github.com/DS4SD/docling/issues/283#issuecomment-2465035868)
|
Source: Issue [#283](https://github.com/docling-project/docling/issues/283#issuecomment-2465035868)
|
||||||
|
|
||||||
|
|
||||||
??? question "Are text styles (bold, underline, etc) supported?"
|
??? question "Are text styles (bold, underline, etc) supported?"
|
||||||
@ -74,7 +74,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
Source: Issue [#326](https://github.com/DS4SD/docling/issues/326)
|
Source: Issue [#326](https://github.com/docling-project/docling/issues/326)
|
||||||
|
|
||||||
|
|
||||||
??? question " Which model weights are needed to run Docling?"
|
??? question " Which model weights are needed to run Docling?"
|
||||||
@ -84,7 +84,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
|||||||
|
|
||||||
For processing PDF documents, Docling requires the model weights from <https://huggingface.co/ds4sd/docling-models>.
|
For processing PDF documents, Docling requires the model weights from <https://huggingface.co/ds4sd/docling-models>.
|
||||||
|
|
||||||
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/DS4SD/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
|
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has [special pipeline options](https://github.com/docling-project/docling/blob/main/docling/datamodel/pipeline_options.py#L68) to control the runtime behavior.
|
||||||
|
|
||||||
|
|
||||||
??? question "SSL error downloading model weights"
|
??? question "SSL error downloading model weights"
|
||||||
@ -174,6 +174,6 @@ This is a collection of FAQ collected from the user questions on <https://github
|
|||||||
print(f"Model max length: {tokenizer.model_max_length}")
|
print(f"Model max length: {tokenizer.model_max_length}")
|
||||||
```
|
```
|
||||||
|
|
||||||
Also see [docling#725](https://github.com/DS4SD/docling/issues/725).
|
Also see [docling#725](https://github.com/docling-project/docling/issues/725).
|
||||||
|
|
||||||
Source: Issue [docling-core#119](https://github.com/DS4SD/docling-core/issues/119)
|
Source: Issue [docling-core#119](https://github.com/docling-project/docling-core/issues/119)
|
||||||
|
@ -11,7 +11,7 @@
|
|||||||
[](https://pycqa.github.io/isort/)
|
[](https://pycqa.github.io/isort/)
|
||||||
[](https://pydantic.dev)
|
[](https://pydantic.dev)
|
||||||
[](https://github.com/pre-commit/pre-commit)
|
[](https://github.com/pre-commit/pre-commit)
|
||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
[](https://pepy.tech/projects/docling)
|
[](https://pepy.tech/projects/docling)
|
||||||
|
|
||||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||||
|
@ -5,7 +5,7 @@ Docling is available as a converter in [Haystack](https://haystack.deepset.ai/):
|
|||||||
- 🧑🏽🍳 [Docling Haystack integration example][example]
|
- 🧑🏽🍳 [Docling Haystack integration example][example]
|
||||||
- 📦 [Docling Haystack integration PyPI][pypi]
|
- 📦 [Docling Haystack integration PyPI][pypi]
|
||||||
|
|
||||||
[github]: https://github.com/DS4SD/docling-haystack
|
[github]: https://github.com/docling-project/docling-haystack
|
||||||
[docs]: https://haystack.deepset.ai/integrations/docling
|
[docs]: https://haystack.deepset.ai/integrations/docling
|
||||||
[pypi]: https://pypi.org/project/docling-haystack
|
[pypi]: https://pypi.org/project/docling-haystack
|
||||||
[example]: ../examples/rag_haystack.ipynb
|
[example]: ../examples/rag_haystack.ipynb
|
||||||
|
@ -8,7 +8,7 @@ To get started, check out the [step-by-step guide in LangChain][guide].
|
|||||||
- 📦 [LangChain Docling integration PyPI][pypi]
|
- 📦 [LangChain Docling integration PyPI][pypi]
|
||||||
|
|
||||||
[docs]: https://python.langchain.com/docs/integrations/providers/docling/
|
[docs]: https://python.langchain.com/docs/integrations/providers/docling/
|
||||||
[github]: https://github.com/DS4SD/docling-langchain
|
[github]: https://github.com/docling-project/docling-langchain
|
||||||
[guide]: https://python.langchain.com/docs/integrations/document_loaders/docling/
|
[guide]: https://python.langchain.com/docs/integrations/document_loaders/docling/
|
||||||
[example]: ../examples/rag_langchain.ipynb
|
[example]: ../examples/rag_langchain.ipynb
|
||||||
[pypi]: https://pypi.org/project/langchain-docling/
|
[pypi]: https://pypi.org/project/langchain-docling/
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
site_name: Docling
|
site_name: Docling
|
||||||
site_url: https://ds4sd.github.io/docling/
|
site_url: https://docling-project.github.io/docling/
|
||||||
repo_name: DS4SD/docling
|
repo_name: docling-project/docling
|
||||||
repo_url: https://github.com/DS4SD/docling
|
repo_url: https://github.com/docling-project/docling
|
||||||
|
|
||||||
theme:
|
theme:
|
||||||
name: material
|
name: material
|
||||||
|
42
poetry.lock
generated
42
poetry.lock
generated
@ -946,8 +946,8 @@ tabulate = ">=0.9.0,<1.0.0"
|
|||||||
[package.source]
|
[package.source]
|
||||||
type = "git"
|
type = "git"
|
||||||
url = "https://github.com/DS4SD/docling-parse"
|
url = "https://github.com/DS4SD/docling-parse"
|
||||||
reference = "cau/api-move-to-docling-core"
|
reference = "main"
|
||||||
resolved_reference = "6d573965abf6e3492b5d41d5cfc52ccd77ab100c"
|
resolved_reference = "a655bc9d59c287661111f6f3d351d61f2239bd86"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "docutils"
|
name = "docutils"
|
||||||
@ -1064,13 +1064,13 @@ devel = ["colorama", "json-spec", "jsonschema", "pylint", "pytest", "pytest-benc
|
|||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "filelock"
|
name = "filelock"
|
||||||
version = "3.17.0"
|
version = "3.18.0"
|
||||||
description = "A platform independent file lock."
|
description = "A platform independent file lock."
|
||||||
optional = false
|
optional = false
|
||||||
python-versions = ">=3.9"
|
python-versions = ">=3.9"
|
||||||
files = [
|
files = [
|
||||||
{file = "filelock-3.17.0-py3-none-any.whl", hash = "sha256:533dc2f7ba78dc2f0f531fc6c4940addf7b70a481e269a5a3b93be94ffbe8338"},
|
{file = "filelock-3.18.0-py3-none-any.whl", hash = "sha256:c401f4f8377c4464e6db25fff06205fd89bdd83b65eb0488ed1b160f780e21de"},
|
||||||
{file = "filelock-3.17.0.tar.gz", hash = "sha256:ee4e77401ef576ebb38cd7f13b9b28893194acc20a8e68e18730ba9c0e54660e"},
|
{file = "filelock-3.18.0.tar.gz", hash = "sha256:adbc88eabb99d2fec8c9c1b229b171f18afa655400173ddc653d5d01501fb9f2"},
|
||||||
]
|
]
|
||||||
|
|
||||||
[package.extras]
|
[package.extras]
|
||||||
@ -4413,20 +4413,20 @@ files = [
|
|||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "protobuf"
|
name = "protobuf"
|
||||||
version = "6.30.0"
|
version = "6.30.1"
|
||||||
description = ""
|
description = ""
|
||||||
optional = false
|
optional = false
|
||||||
python-versions = ">=3.9"
|
python-versions = ">=3.9"
|
||||||
files = [
|
files = [
|
||||||
{file = "protobuf-6.30.0-cp310-abi3-win32.whl", hash = "sha256:7337d76d8efe65ee09ee566b47b5914c517190196f414e5418fa236dfd1aed3e"},
|
{file = "protobuf-6.30.1-cp310-abi3-win32.whl", hash = "sha256:ba0706f948d0195f5cac504da156d88174e03218d9364ab40d903788c1903d7e"},
|
||||||
{file = "protobuf-6.30.0-cp310-abi3-win_amd64.whl", hash = "sha256:9b33d51cc95a7ec4f407004c8b744330b6911a37a782e2629c67e1e8ac41318f"},
|
{file = "protobuf-6.30.1-cp310-abi3-win_amd64.whl", hash = "sha256:ed484f9ddd47f0f1bf0648806cccdb4fe2fb6b19820f9b79a5adf5dcfd1b8c5f"},
|
||||||
{file = "protobuf-6.30.0-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:52d4bb6fe76005860e1d0b8bfa126f5c97c19cc82704961f60718f50be16942d"},
|
{file = "protobuf-6.30.1-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:aa4f7dfaed0d840b03d08d14bfdb41348feaee06a828a8c455698234135b4075"},
|
||||||
{file = "protobuf-6.30.0-cp39-abi3-manylinux2014_aarch64.whl", hash = "sha256:7940ab4dfd60d514b2e1d3161549ea7aed5be37d53bafde16001ac470a3e202b"},
|
{file = "protobuf-6.30.1-cp39-abi3-manylinux2014_aarch64.whl", hash = "sha256:47cd320b7db63e8c9ac35f5596ea1c1e61491d8a8eb6d8b45edc44760b53a4f6"},
|
||||||
{file = "protobuf-6.30.0-cp39-abi3-manylinux2014_x86_64.whl", hash = "sha256:d79bf6a202a536b192b7e8d295d7eece0c86fbd9b583d147faf8cfeff46bf598"},
|
{file = "protobuf-6.30.1-cp39-abi3-manylinux2014_x86_64.whl", hash = "sha256:e3083660225fa94748ac2e407f09a899e6a28bf9c0e70c75def8d15706bf85fc"},
|
||||||
{file = "protobuf-6.30.0-cp39-cp39-win32.whl", hash = "sha256:bb35ad251d222f03d6c4652c072dfee156be0ef9578373929c1a7ead2bd5492c"},
|
{file = "protobuf-6.30.1-cp39-cp39-win32.whl", hash = "sha256:554d7e61cce2aa4c63ca27328f757a9f3867bce8ec213bf09096a8d16bcdcb6a"},
|
||||||
{file = "protobuf-6.30.0-cp39-cp39-win_amd64.whl", hash = "sha256:501810e0eba1d327e783fde47cc767a563b0f1c292f1a3546d4f2b8c3612d4d0"},
|
{file = "protobuf-6.30.1-cp39-cp39-win_amd64.whl", hash = "sha256:b510f55ce60f84dc7febc619b47215b900466e3555ab8cb1ba42deb4496d6cc0"},
|
||||||
{file = "protobuf-6.30.0-py3-none-any.whl", hash = "sha256:e5ef216ea061b262b8994cb6b7d6637a4fb27b3fb4d8e216a6040c0b93bd10d7"},
|
{file = "protobuf-6.30.1-py3-none-any.whl", hash = "sha256:3c25e51e1359f1f5fa3b298faa6016e650d148f214db2e47671131b9063c53be"},
|
||||||
{file = "protobuf-6.30.0.tar.gz", hash = "sha256:852b675d276a7d028f660da075af1841c768618f76b90af771a8e2c29e6f5965"},
|
{file = "protobuf-6.30.1.tar.gz", hash = "sha256:535fb4e44d0236893d5cf1263a0f706f1160b689a7ab962e9da8a9ce4050b780"},
|
||||||
]
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
@ -4789,6 +4789,16 @@ files = [
|
|||||||
[package.extras]
|
[package.extras]
|
||||||
windows-terminal = ["colorama (>=0.4.6)"]
|
windows-terminal = ["colorama (>=0.4.6)"]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "pylatexenc"
|
||||||
|
version = "2.10"
|
||||||
|
description = "Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion"
|
||||||
|
optional = false
|
||||||
|
python-versions = "*"
|
||||||
|
files = [
|
||||||
|
{file = "pylatexenc-2.10.tar.gz", hash = "sha256:3dd8fd84eb46dc30bee1e23eaab8d8fb5a7f507347b23e5f38ad9675c84f40d3"},
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "pylint"
|
name = "pylint"
|
||||||
version = "2.17.7"
|
version = "2.17.7"
|
||||||
@ -7806,4 +7816,4 @@ vlm = ["accelerate", "transformers", "transformers"]
|
|||||||
[metadata]
|
[metadata]
|
||||||
lock-version = "2.0"
|
lock-version = "2.0"
|
||||||
python-versions = "^3.9"
|
python-versions = "^3.9"
|
||||||
content-hash = "da6afcbfeefb3a45560d4098c5a1345333fc833fd13e6408aacb06c6d18317f0"
|
content-hash = "86d3894f8f998af4b7f766ec5060f9f64d532d9b6611d4836271bc0fdfd796c7"
|
||||||
|
@ -2,13 +2,33 @@
|
|||||||
name = "docling"
|
name = "docling"
|
||||||
version = "2.26.0" # DO NOT EDIT, updated automatically
|
version = "2.26.0" # DO NOT EDIT, updated automatically
|
||||||
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
||||||
authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
|
authors = [
|
||||||
|
"Christoph Auer <cau@zurich.ibm.com>",
|
||||||
|
"Michele Dolfi <dol@zurich.ibm.com>",
|
||||||
|
"Maxim Lysak <mly@zurich.ibm.com>",
|
||||||
|
"Nikos Livathinos <nli@zurich.ibm.com>",
|
||||||
|
"Ahmed Nassar <ahn@zurich.ibm.com>",
|
||||||
|
"Panos Vagenas <pva@zurich.ibm.com>",
|
||||||
|
"Peter Staar <taa@zurich.ibm.com>",
|
||||||
|
]
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
repository = "https://github.com/DS4SD/docling"
|
repository = "https://github.com/docling-project/docling"
|
||||||
homepage = "https://github.com/DS4SD/docling"
|
homepage = "https://github.com/docling-project/docling"
|
||||||
keywords= ["docling", "convert", "document", "pdf", "docx", "html", "markdown", "layout model", "segmentation", "table structure", "table former"]
|
keywords = [
|
||||||
classifiers = [
|
"docling",
|
||||||
|
"convert",
|
||||||
|
"document",
|
||||||
|
"pdf",
|
||||||
|
"docx",
|
||||||
|
"html",
|
||||||
|
"markdown",
|
||||||
|
"layout model",
|
||||||
|
"segmentation",
|
||||||
|
"table structure",
|
||||||
|
"table former",
|
||||||
|
]
|
||||||
|
classifiers = [
|
||||||
"License :: OSI Approved :: MIT License",
|
"License :: OSI Approved :: MIT License",
|
||||||
"Operating System :: MacOS :: MacOS X",
|
"Operating System :: MacOS :: MacOS X",
|
||||||
"Operating System :: POSIX :: Linux",
|
"Operating System :: POSIX :: Linux",
|
||||||
@ -16,9 +36,9 @@ keywords= ["docling", "convert", "document", "pdf", "docx", "html", "markdown",
|
|||||||
"Intended Audience :: Developers",
|
"Intended Audience :: Developers",
|
||||||
"Intended Audience :: Science/Research",
|
"Intended Audience :: Science/Research",
|
||||||
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
||||||
"Programming Language :: Python :: 3"
|
"Programming Language :: Python :: 3",
|
||||||
]
|
]
|
||||||
packages = [{include = "docling"}]
|
packages = [{ include = "docling" }]
|
||||||
|
|
||||||
[tool.poetry.dependencies]
|
[tool.poetry.dependencies]
|
||||||
######################
|
######################
|
||||||
@ -28,7 +48,7 @@ python = "^3.9"
|
|||||||
pydantic = "^2.0.0"
|
pydantic = "^2.0.0"
|
||||||
docling-core = {extras = ["chunking"], version = "^2.23.0"}
|
docling-core = {extras = ["chunking"], version = "^2.23.0"}
|
||||||
docling-ibm-models = "^3.4.0"
|
docling-ibm-models = "^3.4.0"
|
||||||
docling-parse = {git = "https://github.com/DS4SD/docling-parse", rev = "cau/api-move-to-docling-core"}
|
docling-parse = {git = "https://github.com/DS4SD/docling-parse", rev = "main"}
|
||||||
filetype = "^1.2.0"
|
filetype = "^1.2.0"
|
||||||
pypdfium2 = "^4.30.0"
|
pypdfium2 = "^4.30.0"
|
||||||
pydantic-settings = "^2.3.0"
|
pydantic-settings = "^2.3.0"
|
||||||
@ -40,7 +60,7 @@ certifi = ">=2024.7.4"
|
|||||||
rtree = "^1.3.0"
|
rtree = "^1.3.0"
|
||||||
scipy = [
|
scipy = [
|
||||||
{ version = "^1.6.0", markers = "python_version >= '3.10'" },
|
{ version = "^1.6.0", markers = "python_version >= '3.10'" },
|
||||||
{ version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" }
|
{ version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" },
|
||||||
]
|
]
|
||||||
typer = "^0.12.5"
|
typer = "^0.12.5"
|
||||||
python-docx = "^1.1.2"
|
python-docx = "^1.1.2"
|
||||||
@ -56,21 +76,22 @@ onnxruntime = [
|
|||||||
# 1.19.2 is the last version with python3.9 support,
|
# 1.19.2 is the last version with python3.9 support,
|
||||||
# see https://github.com/microsoft/onnxruntime/releases/tag/v1.20.0
|
# see https://github.com/microsoft/onnxruntime/releases/tag/v1.20.0
|
||||||
{ version = ">=1.7.0,<1.20.0", optional = true, markers = "python_version < '3.10'" },
|
{ version = ">=1.7.0,<1.20.0", optional = true, markers = "python_version < '3.10'" },
|
||||||
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" }
|
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" },
|
||||||
]
|
]
|
||||||
|
|
||||||
transformers = [
|
transformers = [
|
||||||
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^4.46.0", optional = true },
|
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^4.46.0", optional = true },
|
||||||
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~4.42.0", optional = true }
|
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~4.42.0", optional = true },
|
||||||
]
|
]
|
||||||
accelerate = [
|
accelerate = [
|
||||||
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^1.2.1", optional = true },
|
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^1.2.1", optional = true },
|
||||||
]
|
]
|
||||||
pillow = ">=10.0.0,<12.0.0"
|
pillow = ">=10.0.0,<12.0.0"
|
||||||
tqdm = "^4.65.0"
|
tqdm = "^4.65.0"
|
||||||
|
pylatexenc = "^2.10"
|
||||||
|
|
||||||
[tool.poetry.group.dev.dependencies]
|
[tool.poetry.group.dev.dependencies]
|
||||||
black = {extras = ["jupyter"], version = "^24.4.2"}
|
black = { extras = ["jupyter"], version = "^24.4.2" }
|
||||||
pytest = "^7.2.2"
|
pytest = "^7.2.2"
|
||||||
pre-commit = "^3.7.1"
|
pre-commit = "^3.7.1"
|
||||||
mypy = "^1.10.1"
|
mypy = "^1.10.1"
|
||||||
@ -93,7 +114,7 @@ types-tqdm = "^4.67.0.20241221"
|
|||||||
mkdocs-material = "^9.5.40"
|
mkdocs-material = "^9.5.40"
|
||||||
mkdocs-jupyter = "^0.25.0"
|
mkdocs-jupyter = "^0.25.0"
|
||||||
mkdocs-click = "^0.8.1"
|
mkdocs-click = "^0.8.1"
|
||||||
mkdocstrings = {extras = ["python"], version = "^0.27.0"}
|
mkdocstrings = { extras = ["python"], version = "^0.27.0" }
|
||||||
griffe-pydantic = "^1.1.0"
|
griffe-pydantic = "^1.1.0"
|
||||||
|
|
||||||
[tool.poetry.group.examples.dependencies]
|
[tool.poetry.group.examples.dependencies]
|
||||||
@ -117,12 +138,12 @@ optional = true
|
|||||||
|
|
||||||
[tool.poetry.group.mac_intel.dependencies]
|
[tool.poetry.group.mac_intel.dependencies]
|
||||||
torch = [
|
torch = [
|
||||||
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^2.2.2"},
|
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^2.2.2" },
|
||||||
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~2.2.2"}
|
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~2.2.2" },
|
||||||
]
|
]
|
||||||
torchvision = [
|
torchvision = [
|
||||||
{markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^0"},
|
{ markers = "sys_platform != 'darwin' or platform_machine != 'x86_64'", version = "^0" },
|
||||||
{markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~0.17.2"}
|
{ markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'", version = "~0.17.2" },
|
||||||
]
|
]
|
||||||
|
|
||||||
[tool.poetry.extras]
|
[tool.poetry.extras]
|
||||||
@ -147,7 +168,7 @@ include = '\.pyi?$'
|
|||||||
[tool.isort]
|
[tool.isort]
|
||||||
profile = "black"
|
profile = "black"
|
||||||
line_length = 88
|
line_length = 88
|
||||||
py_version=39
|
py_version = 39
|
||||||
|
|
||||||
[tool.mypy]
|
[tool.mypy]
|
||||||
pretty = true
|
pretty = true
|
||||||
@ -170,6 +191,7 @@ module = [
|
|||||||
"lxml.*",
|
"lxml.*",
|
||||||
"huggingface_hub.*",
|
"huggingface_hub.*",
|
||||||
"transformers.*",
|
"transformers.*",
|
||||||
|
"pylatexenc.*",
|
||||||
]
|
]
|
||||||
ignore_missing_imports = true
|
ignore_missing_imports = true
|
||||||
|
|
||||||
|
BIN
tests/data/docx/equations.docx
Normal file
BIN
tests/data/docx/equations.docx
Normal file
Binary file not shown.
@ -51,7 +51,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "1",
|
"text": "1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -63,7 +63,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "2",
|
"text": "2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -75,7 +75,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "3",
|
"text": "3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -87,7 +87,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "4",
|
"text": "4",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -296,7 +296,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "1",
|
"text": "1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -308,7 +308,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "2",
|
"text": "2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -320,7 +320,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "3",
|
"text": "3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -332,7 +332,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "4",
|
"text": "4",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -51,7 +51,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Index",
|
"text": "Index",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -63,7 +63,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Customer Id",
|
"text": "Customer Id",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -75,7 +75,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "First Name",
|
"text": "First Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -87,7 +87,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Last Name",
|
"text": "Last Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -99,7 +99,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "Company",
|
"text": "Company",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -111,7 +111,7 @@
|
|||||||
"start_col_offset_idx": 5,
|
"start_col_offset_idx": 5,
|
||||||
"end_col_offset_idx": 6,
|
"end_col_offset_idx": 6,
|
||||||
"text": "City",
|
"text": "City",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -123,7 +123,7 @@
|
|||||||
"start_col_offset_idx": 6,
|
"start_col_offset_idx": 6,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Country",
|
"text": "Country",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -135,7 +135,7 @@
|
|||||||
"start_col_offset_idx": 7,
|
"start_col_offset_idx": 7,
|
||||||
"end_col_offset_idx": 8,
|
"end_col_offset_idx": 8,
|
||||||
"text": "Phone 1",
|
"text": "Phone 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -147,7 +147,7 @@
|
|||||||
"start_col_offset_idx": 8,
|
"start_col_offset_idx": 8,
|
||||||
"end_col_offset_idx": 9,
|
"end_col_offset_idx": 9,
|
||||||
"text": "Phone 2",
|
"text": "Phone 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -159,7 +159,7 @@
|
|||||||
"start_col_offset_idx": 9,
|
"start_col_offset_idx": 9,
|
||||||
"end_col_offset_idx": 10,
|
"end_col_offset_idx": 10,
|
||||||
"text": "Email",
|
"text": "Email",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -171,7 +171,7 @@
|
|||||||
"start_col_offset_idx": 10,
|
"start_col_offset_idx": 10,
|
||||||
"end_col_offset_idx": 11,
|
"end_col_offset_idx": 11,
|
||||||
"text": "Subscription Date",
|
"text": "Subscription Date",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -183,7 +183,7 @@
|
|||||||
"start_col_offset_idx": 11,
|
"start_col_offset_idx": 11,
|
||||||
"end_col_offset_idx": 12,
|
"end_col_offset_idx": 12,
|
||||||
"text": "Website",
|
"text": "Website",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -920,7 +920,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Index",
|
"text": "Index",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -932,7 +932,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Customer Id",
|
"text": "Customer Id",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -944,7 +944,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "First Name",
|
"text": "First Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -956,7 +956,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Last Name",
|
"text": "Last Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -968,7 +968,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "Company",
|
"text": "Company",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -980,7 +980,7 @@
|
|||||||
"start_col_offset_idx": 5,
|
"start_col_offset_idx": 5,
|
||||||
"end_col_offset_idx": 6,
|
"end_col_offset_idx": 6,
|
||||||
"text": "City",
|
"text": "City",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -992,7 +992,7 @@
|
|||||||
"start_col_offset_idx": 6,
|
"start_col_offset_idx": 6,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Country",
|
"text": "Country",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1004,7 +1004,7 @@
|
|||||||
"start_col_offset_idx": 7,
|
"start_col_offset_idx": 7,
|
||||||
"end_col_offset_idx": 8,
|
"end_col_offset_idx": 8,
|
||||||
"text": "Phone 1",
|
"text": "Phone 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1016,7 +1016,7 @@
|
|||||||
"start_col_offset_idx": 8,
|
"start_col_offset_idx": 8,
|
||||||
"end_col_offset_idx": 9,
|
"end_col_offset_idx": 9,
|
||||||
"text": "Phone 2",
|
"text": "Phone 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1028,7 +1028,7 @@
|
|||||||
"start_col_offset_idx": 9,
|
"start_col_offset_idx": 9,
|
||||||
"end_col_offset_idx": 10,
|
"end_col_offset_idx": 10,
|
||||||
"text": "Email",
|
"text": "Email",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1040,7 +1040,7 @@
|
|||||||
"start_col_offset_idx": 10,
|
"start_col_offset_idx": 10,
|
||||||
"end_col_offset_idx": 11,
|
"end_col_offset_idx": 11,
|
||||||
"text": "Subscription Date",
|
"text": "Subscription Date",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1052,7 +1052,7 @@
|
|||||||
"start_col_offset_idx": 11,
|
"start_col_offset_idx": 11,
|
||||||
"end_col_offset_idx": 12,
|
"end_col_offset_idx": 12,
|
||||||
"text": "Website",
|
"text": "Website",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -51,7 +51,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "1",
|
"text": "1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -63,7 +63,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "2",
|
"text": "2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -75,7 +75,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "3",
|
"text": "3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -284,7 +284,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "1",
|
"text": "1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -296,7 +296,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "2",
|
"text": "2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -308,7 +308,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "3",
|
"text": "3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
|
@ -51,7 +51,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Index",
|
"text": "Index",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -63,7 +63,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Customer Id",
|
"text": "Customer Id",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -75,7 +75,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "First Name",
|
"text": "First Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -87,7 +87,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Last Name",
|
"text": "Last Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -99,7 +99,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "Company",
|
"text": "Company",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -111,7 +111,7 @@
|
|||||||
"start_col_offset_idx": 5,
|
"start_col_offset_idx": 5,
|
||||||
"end_col_offset_idx": 6,
|
"end_col_offset_idx": 6,
|
||||||
"text": "City",
|
"text": "City",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -123,7 +123,7 @@
|
|||||||
"start_col_offset_idx": 6,
|
"start_col_offset_idx": 6,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Country",
|
"text": "Country",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -135,7 +135,7 @@
|
|||||||
"start_col_offset_idx": 7,
|
"start_col_offset_idx": 7,
|
||||||
"end_col_offset_idx": 8,
|
"end_col_offset_idx": 8,
|
||||||
"text": "Phone 1",
|
"text": "Phone 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -147,7 +147,7 @@
|
|||||||
"start_col_offset_idx": 8,
|
"start_col_offset_idx": 8,
|
||||||
"end_col_offset_idx": 9,
|
"end_col_offset_idx": 9,
|
||||||
"text": "Phone 2",
|
"text": "Phone 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -159,7 +159,7 @@
|
|||||||
"start_col_offset_idx": 9,
|
"start_col_offset_idx": 9,
|
||||||
"end_col_offset_idx": 10,
|
"end_col_offset_idx": 10,
|
||||||
"text": "Email",
|
"text": "Email",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -171,7 +171,7 @@
|
|||||||
"start_col_offset_idx": 10,
|
"start_col_offset_idx": 10,
|
||||||
"end_col_offset_idx": 11,
|
"end_col_offset_idx": 11,
|
||||||
"text": "Subscription Date",
|
"text": "Subscription Date",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -183,7 +183,7 @@
|
|||||||
"start_col_offset_idx": 11,
|
"start_col_offset_idx": 11,
|
||||||
"end_col_offset_idx": 12,
|
"end_col_offset_idx": 12,
|
||||||
"text": "Website",
|
"text": "Website",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -920,7 +920,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Index",
|
"text": "Index",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -932,7 +932,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Customer Id",
|
"text": "Customer Id",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -944,7 +944,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "First Name",
|
"text": "First Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -956,7 +956,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Last Name",
|
"text": "Last Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -968,7 +968,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "Company",
|
"text": "Company",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -980,7 +980,7 @@
|
|||||||
"start_col_offset_idx": 5,
|
"start_col_offset_idx": 5,
|
||||||
"end_col_offset_idx": 6,
|
"end_col_offset_idx": 6,
|
||||||
"text": "City",
|
"text": "City",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -992,7 +992,7 @@
|
|||||||
"start_col_offset_idx": 6,
|
"start_col_offset_idx": 6,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Country",
|
"text": "Country",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1004,7 +1004,7 @@
|
|||||||
"start_col_offset_idx": 7,
|
"start_col_offset_idx": 7,
|
||||||
"end_col_offset_idx": 8,
|
"end_col_offset_idx": 8,
|
||||||
"text": "Phone 1",
|
"text": "Phone 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1016,7 +1016,7 @@
|
|||||||
"start_col_offset_idx": 8,
|
"start_col_offset_idx": 8,
|
||||||
"end_col_offset_idx": 9,
|
"end_col_offset_idx": 9,
|
||||||
"text": "Phone 2",
|
"text": "Phone 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1028,7 +1028,7 @@
|
|||||||
"start_col_offset_idx": 9,
|
"start_col_offset_idx": 9,
|
||||||
"end_col_offset_idx": 10,
|
"end_col_offset_idx": 10,
|
||||||
"text": "Email",
|
"text": "Email",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1040,7 +1040,7 @@
|
|||||||
"start_col_offset_idx": 10,
|
"start_col_offset_idx": 10,
|
||||||
"end_col_offset_idx": 11,
|
"end_col_offset_idx": 11,
|
||||||
"text": "Subscription Date",
|
"text": "Subscription Date",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1052,7 +1052,7 @@
|
|||||||
"start_col_offset_idx": 11,
|
"start_col_offset_idx": 11,
|
||||||
"end_col_offset_idx": 12,
|
"end_col_offset_idx": 12,
|
||||||
"text": "Website",
|
"text": "Website",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -51,7 +51,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Index",
|
"text": "Index",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -63,7 +63,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Customer Id",
|
"text": "Customer Id",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -75,7 +75,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "First Name",
|
"text": "First Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -87,7 +87,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Last Name",
|
"text": "Last Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -99,7 +99,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "Company",
|
"text": "Company",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -111,7 +111,7 @@
|
|||||||
"start_col_offset_idx": 5,
|
"start_col_offset_idx": 5,
|
||||||
"end_col_offset_idx": 6,
|
"end_col_offset_idx": 6,
|
||||||
"text": "City",
|
"text": "City",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -123,7 +123,7 @@
|
|||||||
"start_col_offset_idx": 6,
|
"start_col_offset_idx": 6,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Country",
|
"text": "Country",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -135,7 +135,7 @@
|
|||||||
"start_col_offset_idx": 7,
|
"start_col_offset_idx": 7,
|
||||||
"end_col_offset_idx": 8,
|
"end_col_offset_idx": 8,
|
||||||
"text": "Phone 1",
|
"text": "Phone 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -147,7 +147,7 @@
|
|||||||
"start_col_offset_idx": 8,
|
"start_col_offset_idx": 8,
|
||||||
"end_col_offset_idx": 9,
|
"end_col_offset_idx": 9,
|
||||||
"text": "Phone 2",
|
"text": "Phone 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -159,7 +159,7 @@
|
|||||||
"start_col_offset_idx": 9,
|
"start_col_offset_idx": 9,
|
||||||
"end_col_offset_idx": 10,
|
"end_col_offset_idx": 10,
|
||||||
"text": "Email",
|
"text": "Email",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -171,7 +171,7 @@
|
|||||||
"start_col_offset_idx": 10,
|
"start_col_offset_idx": 10,
|
||||||
"end_col_offset_idx": 11,
|
"end_col_offset_idx": 11,
|
||||||
"text": "Subscription Date",
|
"text": "Subscription Date",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -183,7 +183,7 @@
|
|||||||
"start_col_offset_idx": 11,
|
"start_col_offset_idx": 11,
|
||||||
"end_col_offset_idx": 12,
|
"end_col_offset_idx": 12,
|
||||||
"text": "Website",
|
"text": "Website",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -920,7 +920,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Index",
|
"text": "Index",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -932,7 +932,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Customer Id",
|
"text": "Customer Id",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -944,7 +944,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "First Name",
|
"text": "First Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -956,7 +956,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Last Name",
|
"text": "Last Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -968,7 +968,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "Company",
|
"text": "Company",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -980,7 +980,7 @@
|
|||||||
"start_col_offset_idx": 5,
|
"start_col_offset_idx": 5,
|
||||||
"end_col_offset_idx": 6,
|
"end_col_offset_idx": 6,
|
||||||
"text": "City",
|
"text": "City",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -992,7 +992,7 @@
|
|||||||
"start_col_offset_idx": 6,
|
"start_col_offset_idx": 6,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Country",
|
"text": "Country",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1004,7 +1004,7 @@
|
|||||||
"start_col_offset_idx": 7,
|
"start_col_offset_idx": 7,
|
||||||
"end_col_offset_idx": 8,
|
"end_col_offset_idx": 8,
|
||||||
"text": "Phone 1",
|
"text": "Phone 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1016,7 +1016,7 @@
|
|||||||
"start_col_offset_idx": 8,
|
"start_col_offset_idx": 8,
|
||||||
"end_col_offset_idx": 9,
|
"end_col_offset_idx": 9,
|
||||||
"text": "Phone 2",
|
"text": "Phone 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1028,7 +1028,7 @@
|
|||||||
"start_col_offset_idx": 9,
|
"start_col_offset_idx": 9,
|
||||||
"end_col_offset_idx": 10,
|
"end_col_offset_idx": 10,
|
||||||
"text": "Email",
|
"text": "Email",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1040,7 +1040,7 @@
|
|||||||
"start_col_offset_idx": 10,
|
"start_col_offset_idx": 10,
|
||||||
"end_col_offset_idx": 11,
|
"end_col_offset_idx": 11,
|
||||||
"text": "Subscription Date",
|
"text": "Subscription Date",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1052,7 +1052,7 @@
|
|||||||
"start_col_offset_idx": 11,
|
"start_col_offset_idx": 11,
|
||||||
"end_col_offset_idx": 12,
|
"end_col_offset_idx": 12,
|
||||||
"text": "Website",
|
"text": "Website",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -51,7 +51,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Index",
|
"text": "Index",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -63,7 +63,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Customer Id",
|
"text": "Customer Id",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -75,7 +75,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "First Name",
|
"text": "First Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -87,7 +87,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Last Name",
|
"text": "Last Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -99,7 +99,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "Company",
|
"text": "Company",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -111,7 +111,7 @@
|
|||||||
"start_col_offset_idx": 5,
|
"start_col_offset_idx": 5,
|
||||||
"end_col_offset_idx": 6,
|
"end_col_offset_idx": 6,
|
||||||
"text": "City",
|
"text": "City",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -123,7 +123,7 @@
|
|||||||
"start_col_offset_idx": 6,
|
"start_col_offset_idx": 6,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Country",
|
"text": "Country",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -135,7 +135,7 @@
|
|||||||
"start_col_offset_idx": 7,
|
"start_col_offset_idx": 7,
|
||||||
"end_col_offset_idx": 8,
|
"end_col_offset_idx": 8,
|
||||||
"text": "Phone 1",
|
"text": "Phone 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -147,7 +147,7 @@
|
|||||||
"start_col_offset_idx": 8,
|
"start_col_offset_idx": 8,
|
||||||
"end_col_offset_idx": 9,
|
"end_col_offset_idx": 9,
|
||||||
"text": "Phone 2",
|
"text": "Phone 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -159,7 +159,7 @@
|
|||||||
"start_col_offset_idx": 9,
|
"start_col_offset_idx": 9,
|
||||||
"end_col_offset_idx": 10,
|
"end_col_offset_idx": 10,
|
||||||
"text": "Email",
|
"text": "Email",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -171,7 +171,7 @@
|
|||||||
"start_col_offset_idx": 10,
|
"start_col_offset_idx": 10,
|
||||||
"end_col_offset_idx": 11,
|
"end_col_offset_idx": 11,
|
||||||
"text": "Subscription Date",
|
"text": "Subscription Date",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -183,7 +183,7 @@
|
|||||||
"start_col_offset_idx": 11,
|
"start_col_offset_idx": 11,
|
||||||
"end_col_offset_idx": 12,
|
"end_col_offset_idx": 12,
|
||||||
"text": "Website",
|
"text": "Website",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -920,7 +920,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Index",
|
"text": "Index",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -932,7 +932,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Customer Id",
|
"text": "Customer Id",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -944,7 +944,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "First Name",
|
"text": "First Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -956,7 +956,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Last Name",
|
"text": "Last Name",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -968,7 +968,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "Company",
|
"text": "Company",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -980,7 +980,7 @@
|
|||||||
"start_col_offset_idx": 5,
|
"start_col_offset_idx": 5,
|
||||||
"end_col_offset_idx": 6,
|
"end_col_offset_idx": 6,
|
||||||
"text": "City",
|
"text": "City",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -992,7 +992,7 @@
|
|||||||
"start_col_offset_idx": 6,
|
"start_col_offset_idx": 6,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Country",
|
"text": "Country",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1004,7 +1004,7 @@
|
|||||||
"start_col_offset_idx": 7,
|
"start_col_offset_idx": 7,
|
||||||
"end_col_offset_idx": 8,
|
"end_col_offset_idx": 8,
|
||||||
"text": "Phone 1",
|
"text": "Phone 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1016,7 +1016,7 @@
|
|||||||
"start_col_offset_idx": 8,
|
"start_col_offset_idx": 8,
|
||||||
"end_col_offset_idx": 9,
|
"end_col_offset_idx": 9,
|
||||||
"text": "Phone 2",
|
"text": "Phone 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1028,7 +1028,7 @@
|
|||||||
"start_col_offset_idx": 9,
|
"start_col_offset_idx": 9,
|
||||||
"end_col_offset_idx": 10,
|
"end_col_offset_idx": 10,
|
||||||
"text": "Email",
|
"text": "Email",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1040,7 +1040,7 @@
|
|||||||
"start_col_offset_idx": 10,
|
"start_col_offset_idx": 10,
|
||||||
"end_col_offset_idx": 11,
|
"end_col_offset_idx": 11,
|
||||||
"text": "Subscription Date",
|
"text": "Subscription Date",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1052,7 +1052,7 @@
|
|||||||
"start_col_offset_idx": 11,
|
"start_col_offset_idx": 11,
|
||||||
"end_col_offset_idx": 12,
|
"end_col_offset_idx": 12,
|
||||||
"text": "Website",
|
"text": "Website",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -51,7 +51,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "1",
|
"text": "1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -63,7 +63,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "2",
|
"text": "2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -75,7 +75,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "3",
|
"text": "3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -87,7 +87,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "4",
|
"text": "4",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -284,7 +284,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "1",
|
"text": "1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -296,7 +296,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "2",
|
"text": "2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -308,7 +308,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "3",
|
"text": "3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -320,7 +320,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "4",
|
"text": "4",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -51,7 +51,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "1",
|
"text": "1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -63,7 +63,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "2",
|
"text": "2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -75,7 +75,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "3",
|
"text": "3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -87,7 +87,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "4",
|
"text": "4",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -308,7 +308,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "1",
|
"text": "1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -320,7 +320,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "2",
|
"text": "2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -332,7 +332,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "3",
|
"text": "3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -344,7 +344,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "4",
|
"text": "4",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
|
40
tests/data/groundtruth/docling_v2/equations.docx.itxt
Normal file
40
tests/data/groundtruth/docling_v2/equations.docx.itxt
Normal file
@ -0,0 +1,40 @@
|
|||||||
|
item-0 at level 0: unspecified: group _root_
|
||||||
|
item-1 at level 1: inline: group group
|
||||||
|
item-2 at level 2: paragraph: This is a word document and this is an inline equation:
|
||||||
|
item-3 at level 2: formula: A= \pi r^{2}
|
||||||
|
item-4 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
|
||||||
|
item-5 at level 1: paragraph:
|
||||||
|
item-6 at level 1: formula: a^{2}+b^{2}=c^{2} \text{ \texttimes } 23
|
||||||
|
item-7 at level 1: paragraph: And that is an equation by itself. Cheers!
|
||||||
|
item-8 at level 1: paragraph:
|
||||||
|
item-9 at level 1: paragraph: This is another equation:
|
||||||
|
item-10 at level 1: formula: f\left(x\right)=a_{0}+\sum_{n=1} ... })+b_{n}\sin(\frac{n \pi x}{L})\right)
|
||||||
|
item-11 at level 1: paragraph:
|
||||||
|
item-12 at level 1: paragraph: This is text. This is text. This ... s is text. This is text. This is text.
|
||||||
|
item-13 at level 1: paragraph:
|
||||||
|
item-14 at level 1: paragraph:
|
||||||
|
item-15 at level 1: inline: group group
|
||||||
|
item-16 at level 2: paragraph: This is a word document and this is an inline equation:
|
||||||
|
item-17 at level 2: formula: A= \pi r^{2}
|
||||||
|
item-18 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
|
||||||
|
item-19 at level 1: paragraph:
|
||||||
|
item-20 at level 1: formula: \left(x+a\right)^{n}=\sum_{k=0}^ ... ac{}{}{0pt}{}{n}{k}\right)x^{k}a^{n-k}
|
||||||
|
item-21 at level 1: paragraph:
|
||||||
|
item-22 at level 1: paragraph: And that is an equation by itself. Cheers!
|
||||||
|
item-23 at level 1: paragraph:
|
||||||
|
item-24 at level 1: paragraph: This is another equation:
|
||||||
|
item-25 at level 1: paragraph:
|
||||||
|
item-26 at level 1: formula: \left(1+x\right)^{n}=1+\frac{nx} ... ght)x^{2}}{2!}+ \text{ \textellipsis }
|
||||||
|
item-27 at level 1: paragraph:
|
||||||
|
item-28 at level 1: paragraph: This is text. This is text. This ... s is text. This is text. This is text.
|
||||||
|
item-29 at level 1: paragraph:
|
||||||
|
item-30 at level 1: paragraph:
|
||||||
|
item-31 at level 1: inline: group group
|
||||||
|
item-32 at level 2: paragraph: This is a word document and this is an inline equation:
|
||||||
|
item-33 at level 2: formula: A= \pi r^{2}
|
||||||
|
item-34 at level 2: paragraph: . If instead, I want an equation by line, I can do this:
|
||||||
|
item-35 at level 1: paragraph:
|
||||||
|
item-36 at level 1: formula: e^{x}=1+\frac{x}{1!}+\frac{x^{2} ... xtellipsis } , - \infty < x < \infty
|
||||||
|
item-37 at level 1: paragraph:
|
||||||
|
item-38 at level 1: paragraph: And that is an equation by itself. Cheers!
|
||||||
|
item-39 at level 1: paragraph:
|
616
tests/data/groundtruth/docling_v2/equations.docx.json
Normal file
616
tests/data/groundtruth/docling_v2/equations.docx.json
Normal file
@ -0,0 +1,616 @@
|
|||||||
|
{
|
||||||
|
"schema_name": "DoclingDocument",
|
||||||
|
"version": "1.2.0",
|
||||||
|
"name": "equations",
|
||||||
|
"origin": {
|
||||||
|
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
|
"binary_hash": 11121138535595486899,
|
||||||
|
"filename": "equations.docx"
|
||||||
|
},
|
||||||
|
"furniture": {
|
||||||
|
"self_ref": "#/furniture",
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "furniture",
|
||||||
|
"name": "_root_",
|
||||||
|
"label": "unspecified"
|
||||||
|
},
|
||||||
|
"body": {
|
||||||
|
"self_ref": "#/body",
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/0"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/3"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/4"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/5"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/6"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/7"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/8"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/9"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/10"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/11"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/12"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/1"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/16"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/17"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/18"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/19"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/20"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/21"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/22"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/23"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/24"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/25"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/26"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/27"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/31"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/32"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/33"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/34"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/35"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "_root_",
|
||||||
|
"label": "unspecified"
|
||||||
|
},
|
||||||
|
"groups": [
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/0",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/0"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/1"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/2"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "group",
|
||||||
|
"label": "inline"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/1",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/13"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/14"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/15"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "group",
|
||||||
|
"label": "inline"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/2",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/28"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/29"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/30"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "group",
|
||||||
|
"label": "inline"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"texts": [
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/0",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/0"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "This is a word document and this is an inline equation: ",
|
||||||
|
"text": "This is a word document and this is an inline equation: "
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/1",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/0"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "formula",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "A= \\pi r^{2} ",
|
||||||
|
"text": "A= \\pi r^{2} "
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/2",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/0"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": ". If instead, I want an equation by line, I can do this:",
|
||||||
|
"text": ". If instead, I want an equation by line, I can do this:"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/3",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/4",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "formula",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "a^{2}+b^{2}=c^{2} \\text{ \\texttimes } 23",
|
||||||
|
"text": "a^{2}+b^{2}=c^{2} \\text{ \\texttimes } 23"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/5",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "And that is an equation by itself. Cheers!",
|
||||||
|
"text": "And that is an equation by itself. Cheers!"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/6",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/7",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "This is another equation:",
|
||||||
|
"text": "This is another equation:"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/8",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "formula",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "f\\left(x\\right)=a_{0}+\\sum_{n=1}^{ \\infty }\\left(a_{n}\\cos(\\frac{n \\pi x}{L})+b_{n}\\sin(\\frac{n \\pi x}{L})\\right)",
|
||||||
|
"text": "f\\left(x\\right)=a_{0}+\\sum_{n=1}^{ \\infty }\\left(a_{n}\\cos(\\frac{n \\pi x}{L})+b_{n}\\sin(\\frac{n \\pi x}{L})\\right)"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/9",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/10",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.",
|
||||||
|
"text": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/11",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/12",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/13",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/1"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "This is a word document and this is an inline equation: ",
|
||||||
|
"text": "This is a word document and this is an inline equation: "
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/14",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/1"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "formula",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "A= \\pi r^{2} ",
|
||||||
|
"text": "A= \\pi r^{2} "
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/15",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/1"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": ". If instead, I want an equation by line, I can do this:",
|
||||||
|
"text": ". If instead, I want an equation by line, I can do this:"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/16",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/17",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "formula",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "\\left(x+a\\right)^{n}=\\sum_{k=0}^{n}\\left(\\genfrac{}{}{0pt}{}{n}{k}\\right)x^{k}a^{n-k}",
|
||||||
|
"text": "\\left(x+a\\right)^{n}=\\sum_{k=0}^{n}\\left(\\genfrac{}{}{0pt}{}{n}{k}\\right)x^{k}a^{n-k}"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/18",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/19",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "And that is an equation by itself. Cheers!",
|
||||||
|
"text": "And that is an equation by itself. Cheers!"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/20",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/21",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "This is another equation:",
|
||||||
|
"text": "This is another equation:"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/22",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/23",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "formula",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "\\left(1+x\\right)^{n}=1+\\frac{nx}{1!}+\\frac{n\\left(n-1\\right)x^{2}}{2!}+ \\text{ \\textellipsis }",
|
||||||
|
"text": "\\left(1+x\\right)^{n}=1+\\frac{nx}{1!}+\\frac{n\\left(n-1\\right)x^{2}}{2!}+ \\text{ \\textellipsis }"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/24",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/25",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.",
|
||||||
|
"text": "This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/26",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/27",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/28",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "This is a word document and this is an inline equation: ",
|
||||||
|
"text": "This is a word document and this is an inline equation: "
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/29",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "formula",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "A= \\pi r^{2} ",
|
||||||
|
"text": "A= \\pi r^{2} "
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/30",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": ". If instead, I want an equation by line, I can do this:",
|
||||||
|
"text": ". If instead, I want an equation by line, I can do this:"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/31",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/32",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "formula",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "e^{x}=1+\\frac{x}{1!}+\\frac{x^{2}}{2!}+\\frac{x^{3}}{3!}+ \\text{ \\textellipsis } , - \\infty < x < \\infty",
|
||||||
|
"text": "e^{x}=1+\\frac{x}{1!}+\\frac{x^{2}}{2!}+\\frac{x^{3}}{3!}+ \\text{ \\textellipsis } , - \\infty < x < \\infty"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/33",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/34",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "And that is an equation by itself. Cheers!",
|
||||||
|
"text": "And that is an equation by itself. Cheers!"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/35",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "paragraph",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "",
|
||||||
|
"text": ""
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"pictures": [],
|
||||||
|
"tables": [],
|
||||||
|
"key_value_items": [],
|
||||||
|
"form_items": [],
|
||||||
|
"pages": {}
|
||||||
|
}
|
29
tests/data/groundtruth/docling_v2/equations.docx.md
Normal file
29
tests/data/groundtruth/docling_v2/equations.docx.md
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
|
||||||
|
|
||||||
|
$$a^{2}+b^{2}=c^{2} \text{ \texttimes } 23$$
|
||||||
|
|
||||||
|
And that is an equation by itself. Cheers!
|
||||||
|
|
||||||
|
This is another equation:
|
||||||
|
|
||||||
|
$$f\left(x\right)=a_{0}+\sum_{n=1}^{ \infty }\left(a_{n}\cos(\frac{n \pi x}{L})+b_{n}\sin(\frac{n \pi x}{L})\right)$$
|
||||||
|
|
||||||
|
This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.
|
||||||
|
|
||||||
|
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
|
||||||
|
|
||||||
|
$$\left(x+a\right)^{n}=\sum_{k=0}^{n}\left(\genfrac{}{}{0pt}{}{n}{k}\right)x^{k}a^{n-k}$$
|
||||||
|
|
||||||
|
And that is an equation by itself. Cheers!
|
||||||
|
|
||||||
|
This is another equation:
|
||||||
|
|
||||||
|
$$\left(1+x\right)^{n}=1+\frac{nx}{1!}+\frac{n\left(n-1\right)x^{2}}{2!}+ \text{ \textellipsis }$$
|
||||||
|
|
||||||
|
This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text. This is text.
|
||||||
|
|
||||||
|
This is a word document and this is an inline equation: $A= \pi r^{2} $ . If instead, I want an equation by line, I can do this:
|
||||||
|
|
||||||
|
$$e^{x}=1+\frac{x}{1!}+\frac{x^{2}}{2!}+\frac{x^{3}}{3!}+ \text{ \textellipsis } , - \infty < x < \infty$$
|
||||||
|
|
||||||
|
And that is an equation by itself. Cheers!
|
@ -344,7 +344,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 1",
|
"text": "Header 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -356,7 +356,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 2",
|
"text": "Header 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -368,7 +368,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 3",
|
"text": "Header 3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -493,7 +493,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 1",
|
"text": "Header 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -505,7 +505,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 2",
|
"text": "Header 2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -517,7 +517,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 3",
|
"text": "Header 3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -68,7 +68,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 1",
|
"text": "Header 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -80,7 +80,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 2 & 3 (colspan)",
|
"text": "Header 2 & 3 (colspan)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -181,7 +181,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 1",
|
"text": "Header 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -193,7 +193,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 2 & 3 (colspan)",
|
"text": "Header 2 & 3 (colspan)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -205,7 +205,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 2 & 3 (colspan)",
|
"text": "Header 2 & 3 (colspan)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -68,7 +68,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 1",
|
"text": "Header 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -80,7 +80,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 2 & 3 (colspan)",
|
"text": "Header 2 & 3 (colspan)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -181,7 +181,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 1",
|
"text": "Header 1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -193,7 +193,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 2 & 3 (colspan)",
|
"text": "Header 2 & 3 (colspan)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -205,7 +205,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 2 & 3 (colspan)",
|
"text": "Header 2 & 3 (colspan)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
22
tests/data/groundtruth/docling_v2/example_07.html.itxt
Normal file
22
tests/data/groundtruth/docling_v2/example_07.html.itxt
Normal file
@ -0,0 +1,22 @@
|
|||||||
|
item-0 at level 0: unspecified: group _root_
|
||||||
|
item-1 at level 1: list: group list
|
||||||
|
item-2 at level 2: list_item: Asia
|
||||||
|
item-3 at level 3: list: group list
|
||||||
|
item-4 at level 4: list_item: China
|
||||||
|
item-5 at level 4: list_item: Japan
|
||||||
|
item-6 at level 4: list_item: Thailand
|
||||||
|
item-7 at level 2: list_item: Europe
|
||||||
|
item-8 at level 3: list: group list
|
||||||
|
item-9 at level 4: list_item: UK
|
||||||
|
item-10 at level 4: list_item: Germany
|
||||||
|
item-11 at level 4: list_item: Switzerland
|
||||||
|
item-12 at level 5: list: group list
|
||||||
|
item-13 at level 6: list: group list
|
||||||
|
item-14 at level 7: list_item: Bern
|
||||||
|
item-15 at level 7: list_item: Aargau
|
||||||
|
item-16 at level 4: list_item: Italy
|
||||||
|
item-17 at level 5: list: group list
|
||||||
|
item-18 at level 6: list: group list
|
||||||
|
item-19 at level 7: list_item: Piedmont
|
||||||
|
item-20 at level 7: list_item: Liguria
|
||||||
|
item-21 at level 2: list_item: Africa
|
374
tests/data/groundtruth/docling_v2/example_07.html.json
Normal file
374
tests/data/groundtruth/docling_v2/example_07.html.json
Normal file
@ -0,0 +1,374 @@
|
|||||||
|
{
|
||||||
|
"schema_name": "DoclingDocument",
|
||||||
|
"version": "1.2.0",
|
||||||
|
"name": "example_07",
|
||||||
|
"origin": {
|
||||||
|
"mimetype": "text/html",
|
||||||
|
"binary_hash": 623628706615267627,
|
||||||
|
"filename": "example_07.html"
|
||||||
|
},
|
||||||
|
"furniture": {
|
||||||
|
"self_ref": "#/furniture",
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "furniture",
|
||||||
|
"name": "_root_",
|
||||||
|
"label": "unspecified"
|
||||||
|
},
|
||||||
|
"body": {
|
||||||
|
"self_ref": "#/body",
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/0"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "_root_",
|
||||||
|
"label": "unspecified"
|
||||||
|
},
|
||||||
|
"groups": [
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/0",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/body"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/0"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/4"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/13"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "list",
|
||||||
|
"label": "list"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/1",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/texts/0"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/1"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/2"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/3"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "list",
|
||||||
|
"label": "list"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/2",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/texts/4"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/5"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/6"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/7"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/10"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "list",
|
||||||
|
"label": "list"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/3",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/texts/7"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/4"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "list",
|
||||||
|
"label": "list"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/4",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/3"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/8"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/9"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "list",
|
||||||
|
"label": "list"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/5",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/texts/10"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/6"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "list",
|
||||||
|
"label": "list"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/groups/6",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/5"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/11"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"$ref": "#/texts/12"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"name": "list",
|
||||||
|
"label": "list"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"texts": [
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/0",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/0"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/1"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Asia",
|
||||||
|
"text": "Asia",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/1",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/1"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "China",
|
||||||
|
"text": "China",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/2",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/1"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Japan",
|
||||||
|
"text": "Japan",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/3",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/1"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Thailand",
|
||||||
|
"text": "Thailand",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/4",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/0"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Europe",
|
||||||
|
"text": "Europe",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/5",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "UK",
|
||||||
|
"text": "UK",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/6",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Germany",
|
||||||
|
"text": "Germany",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/7",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/3"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Switzerland",
|
||||||
|
"text": "Switzerland",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/8",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/4"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Bern",
|
||||||
|
"text": "Bern",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/9",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/4"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Aargau",
|
||||||
|
"text": "Aargau",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/10",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/2"
|
||||||
|
},
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"$ref": "#/groups/5"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Italy",
|
||||||
|
"text": "Italy",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/11",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/6"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Piedmont",
|
||||||
|
"text": "Piedmont",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/12",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/6"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Liguria",
|
||||||
|
"text": "Liguria",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"self_ref": "#/texts/13",
|
||||||
|
"parent": {
|
||||||
|
"$ref": "#/groups/0"
|
||||||
|
},
|
||||||
|
"children": [],
|
||||||
|
"content_layer": "body",
|
||||||
|
"label": "list_item",
|
||||||
|
"prov": [],
|
||||||
|
"orig": "Africa",
|
||||||
|
"text": "Africa",
|
||||||
|
"enumerated": false,
|
||||||
|
"marker": "-"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"pictures": [],
|
||||||
|
"tables": [],
|
||||||
|
"key_value_items": [],
|
||||||
|
"form_items": [],
|
||||||
|
"pages": {}
|
||||||
|
}
|
14
tests/data/groundtruth/docling_v2/example_07.html.md
Normal file
14
tests/data/groundtruth/docling_v2/example_07.html.md
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
- Asia
|
||||||
|
- China
|
||||||
|
- Japan
|
||||||
|
- Thailand
|
||||||
|
- Europe
|
||||||
|
- UK
|
||||||
|
- Germany
|
||||||
|
- Switzerland
|
||||||
|
- Bern
|
||||||
|
- Aargau
|
||||||
|
- Italy
|
||||||
|
- Piedmont
|
||||||
|
- Liguria
|
||||||
|
- Africa
|
@ -960,7 +960,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Class1",
|
"text": "Class1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -972,7 +972,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Class2",
|
"text": "Class2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1385,7 +1385,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Class1",
|
"text": "Class1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1397,7 +1397,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Class1",
|
"text": "Class1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1409,7 +1409,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Class1",
|
"text": "Class1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1421,7 +1421,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Class2",
|
"text": "Class2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1433,7 +1433,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Class2",
|
"text": "Class2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1445,7 +1445,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 7,
|
"end_col_offset_idx": 7,
|
||||||
"text": "Class2",
|
"text": "Class2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -176,7 +176,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Tab1",
|
"text": "Tab1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -188,7 +188,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Tab2",
|
"text": "Tab2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -200,7 +200,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Tab3",
|
"text": "Tab3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -289,7 +289,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Tab1",
|
"text": "Tab1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -301,7 +301,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Tab2",
|
"text": "Tab2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -313,7 +313,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Tab3",
|
"text": "Tab3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -136,7 +136,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "first ",
|
"text": "first ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -148,7 +148,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "second ",
|
"text": "second ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -160,7 +160,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "third",
|
"text": "third",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -393,7 +393,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "first ",
|
"text": "first ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -405,7 +405,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "second ",
|
"text": "second ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -417,7 +417,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "third",
|
"text": "third",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -675,7 +675,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "col-1",
|
"text": "col-1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -687,7 +687,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "col-2",
|
"text": "col-2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -699,7 +699,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "col-3",
|
"text": "col-3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -711,7 +711,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "col-4",
|
"text": "col-4",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1112,7 +1112,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "col-1",
|
"text": "col-1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1124,7 +1124,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "col-2",
|
"text": "col-2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1136,7 +1136,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "col-3",
|
"text": "col-3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1148,7 +1148,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "col-4",
|
"text": "col-4",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -1578,7 +1578,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "col-1",
|
"text": "col-1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1590,7 +1590,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "col-2",
|
"text": "col-2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1602,7 +1602,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "col-3",
|
"text": "col-3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1763,7 +1763,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "col-1",
|
"text": "col-1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1775,7 +1775,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "col-2",
|
"text": "col-2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1787,7 +1787,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "col-3",
|
"text": "col-3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -1969,7 +1969,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "col-1",
|
"text": "col-1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1981,7 +1981,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "col-2",
|
"text": "col-2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1993,7 +1993,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "col-3",
|
"text": "col-3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2154,7 +2154,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "col-1",
|
"text": "col-1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2166,7 +2166,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "col-2",
|
"text": "col-2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2178,7 +2178,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "col-3",
|
"text": "col-3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -2360,7 +2360,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "first ",
|
"text": "first ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2372,7 +2372,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "header",
|
"text": "header",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2545,7 +2545,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "first ",
|
"text": "first ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2557,7 +2557,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "header",
|
"text": "header",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2569,7 +2569,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "header",
|
"text": "header",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -2583,7 +2583,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "first ",
|
"text": "first ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2827,7 +2827,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "first (f)",
|
"text": "first (f)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -2839,7 +2839,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "header (f)",
|
"text": "header (f)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -3012,7 +3012,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "first (f)",
|
"text": "first (f)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -3024,7 +3024,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "header (f)",
|
"text": "header (f)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -3036,7 +3036,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "header (f)",
|
"text": "header (f)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -3050,7 +3050,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "first (f)",
|
"text": "first (f)",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
|
@ -7914,7 +7914,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Duck\n",
|
"text": "Duck\n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -7950,7 +7950,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Scientific classification \n",
|
"text": "Scientific classification \n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -8130,7 +8130,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Subfamilies\n",
|
"text": "Subfamilies\n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -8159,7 +8159,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Duck\n",
|
"text": "Duck\n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -8171,7 +8171,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Duck\n",
|
"text": "Duck\n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -8237,7 +8237,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Scientific classification \n",
|
"text": "Scientific classification \n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -8249,7 +8249,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Scientific classification \n",
|
"text": "Scientific classification \n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -8445,7 +8445,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Subfamilies\n",
|
"text": "Subfamilies\n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -8457,7 +8457,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Subfamilies\n",
|
"text": "Subfamilies\n",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -8513,7 +8513,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Authority control databases ",
|
"text": "Authority control databases ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -8578,7 +8578,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Authority control databases ",
|
"text": "Authority control databases ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -8590,7 +8590,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Authority control databases ",
|
"text": "Authority control databases ",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -490,7 +490,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "",
|
"text": "",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -502,7 +502,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Food",
|
"text": "Food",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -514,7 +514,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Calories per portion",
|
"text": "Calories per portion",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -639,7 +639,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "",
|
"text": "",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -651,7 +651,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Food",
|
"text": "Food",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -663,7 +663,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Calories per portion",
|
"text": "Calories per portion",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
@ -71,19 +71,19 @@
|
|||||||
</head>
|
</head>
|
||||||
<h2>Test with tables</h2>
|
<h2>Test with tables</h2>
|
||||||
<p>A uniform table</p>
|
<p>A uniform table</p>
|
||||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td>Cell 1.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.1</td><td>Cell 2.2</td></tr></tbody></table>
|
<table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th></tr><tr><td>Cell 1.0</td><td>Cell 1.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.1</td><td>Cell 2.2</td></tr></tbody></table>
|
||||||
<p></p>
|
<p></p>
|
||||||
<p>A non-uniform table with horizontal spans</p>
|
<p>A non-uniform table with horizontal spans</p>
|
||||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td></tr></tbody></table>
|
<table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td></tr></tbody></table>
|
||||||
<p></p>
|
<p></p>
|
||||||
<p>A non-uniform table with horizontal spans in inner columns</p>
|
<p>A non-uniform table with horizontal spans in inner columns</p>
|
||||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td>Header 0.3</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td><td>Cell 1.3</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td><td>Cell 2.3</td></tr></tbody></table>
|
<table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th><th>Header 0.3</th></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td><td>Cell 1.3</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td><td>Cell 2.3</td></tr></tbody></table>
|
||||||
<p></p>
|
<p></p>
|
||||||
<p>A non-uniform table with vertical spans</p>
|
<p>A non-uniform table with vertical spans</p>
|
||||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td></tr></tbody></table>
|
<table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td></tr></tbody></table>
|
||||||
<p></p>
|
<p></p>
|
||||||
<p>A non-uniform table with all kinds of spans and empty cells</p>
|
<p>A non-uniform table with all kinds of spans and empty cells</p>
|
||||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td></td><td></td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td><td></td><td></td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td><td></td><td></td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td><td rowspan="3"></td><td></td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td><td rowspan="2">Merged Cell 4.4 5.4</td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td colspan="5"></td></tr><tr><td></td><td></td><td></td><td></td><td>Cell 8.4</td></tr></tbody></table>
|
<table><tbody><tr><th>Header 0.0</th><th>Header 0.1</th><th>Header 0.2</th><th></th><th></th></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td><td></td><td></td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td><td></td><td></td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td><td rowspan="3"></td><td></td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td><td rowspan="2">Merged Cell 4.4 5.4</td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td colspan="5"></td></tr><tr><td></td><td></td><td></td><td></td><td>Cell 8.4</td></tr></tbody></table>
|
||||||
<p></p>
|
<p></p>
|
||||||
<p></p>
|
<p></p>
|
||||||
</html>
|
</html>
|
@ -261,7 +261,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -273,7 +273,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -285,7 +285,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -374,7 +374,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -386,7 +386,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -398,7 +398,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -504,7 +504,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -516,7 +516,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -528,7 +528,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -593,7 +593,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -605,7 +605,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -617,7 +617,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -723,7 +723,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -735,7 +735,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -747,7 +747,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -759,7 +759,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Header 0.3",
|
"text": "Header 0.3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -848,7 +848,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -860,7 +860,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -872,7 +872,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -884,7 +884,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "Header 0.3",
|
"text": "Header 0.3",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -1014,7 +1014,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1026,7 +1026,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1038,7 +1038,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1175,7 +1175,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1187,7 +1187,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1199,7 +1199,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
@ -1381,7 +1381,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1393,7 +1393,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1405,7 +1405,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1417,7 +1417,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "",
|
"text": "",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1429,7 +1429,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "",
|
"text": "",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1818,7 +1818,7 @@
|
|||||||
"start_col_offset_idx": 0,
|
"start_col_offset_idx": 0,
|
||||||
"end_col_offset_idx": 1,
|
"end_col_offset_idx": 1,
|
||||||
"text": "Header 0.0",
|
"text": "Header 0.0",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1830,7 +1830,7 @@
|
|||||||
"start_col_offset_idx": 1,
|
"start_col_offset_idx": 1,
|
||||||
"end_col_offset_idx": 2,
|
"end_col_offset_idx": 2,
|
||||||
"text": "Header 0.1",
|
"text": "Header 0.1",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1842,7 +1842,7 @@
|
|||||||
"start_col_offset_idx": 2,
|
"start_col_offset_idx": 2,
|
||||||
"end_col_offset_idx": 3,
|
"end_col_offset_idx": 3,
|
||||||
"text": "Header 0.2",
|
"text": "Header 0.2",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1854,7 +1854,7 @@
|
|||||||
"start_col_offset_idx": 3,
|
"start_col_offset_idx": 3,
|
||||||
"end_col_offset_idx": 4,
|
"end_col_offset_idx": 4,
|
||||||
"text": "",
|
"text": "",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
},
|
},
|
||||||
@ -1866,7 +1866,7 @@
|
|||||||
"start_col_offset_idx": 4,
|
"start_col_offset_idx": 4,
|
||||||
"end_col_offset_idx": 5,
|
"end_col_offset_idx": 5,
|
||||||
"text": "",
|
"text": "",
|
||||||
"column_header": false,
|
"column_header": true,
|
||||||
"row_header": false,
|
"row_header": false,
|
||||||
"row_section": false
|
"row_section": false
|
||||||
}
|
}
|
||||||
|
40
tests/data/html/example_07.html
Normal file
40
tests/data/html/example_07.html
Normal file
@ -0,0 +1,40 @@
|
|||||||
|
<html>
|
||||||
|
<body>
|
||||||
|
<ul>
|
||||||
|
<li>Asia
|
||||||
|
<ul>
|
||||||
|
<li>China</li>
|
||||||
|
<li>Japan</li>
|
||||||
|
<li>Thailand</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
<li>Europe
|
||||||
|
<ul>
|
||||||
|
<li>UK</li>
|
||||||
|
<li>Germany</li>
|
||||||
|
<li>Switzerland
|
||||||
|
<ul>
|
||||||
|
<li style="list-style-type: none;">
|
||||||
|
<ul>
|
||||||
|
<li>Bern</li>
|
||||||
|
<li>Aargau</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
<li>Italy
|
||||||
|
<ul>
|
||||||
|
<li style="list-style-type: none;">
|
||||||
|
<ul>
|
||||||
|
<li>Piedmont</li>
|
||||||
|
<li>Liguria</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
<li>Africa</li>
|
||||||
|
</ul>
|
||||||
|
</body>
|
||||||
|
</html>
|
@ -59,7 +59,11 @@ def test_e2e_valid_csv_conversions():
|
|||||||
pred_itxt, str(gt_path) + ".itxt"
|
pred_itxt, str(gt_path) + ".itxt"
|
||||||
), "export to indented-text"
|
), "export to indented-text"
|
||||||
|
|
||||||
assert verify_document(doc, str(gt_path) + ".json"), "export to json"
|
assert verify_document(
|
||||||
|
pred_doc=doc,
|
||||||
|
gtfile=str(gt_path) + ".json",
|
||||||
|
generate=GENERATE,
|
||||||
|
), "export to json"
|
||||||
|
|
||||||
|
|
||||||
def test_e2e_invalid_csv_conversions():
|
def test_e2e_invalid_csv_conversions():
|
||||||
|
@ -91,4 +91,8 @@ def test_e2e_docx_conversions():
|
|||||||
|
|
||||||
if docx_path.name == "word_tables.docx":
|
if docx_path.name == "word_tables.docx":
|
||||||
pred_html: str = doc.export_to_html()
|
pred_html: str = doc.export_to_html()
|
||||||
assert verify_export(pred_html, str(gt_path) + ".html"), "export to html"
|
assert verify_export(
|
||||||
|
pred_text=pred_html,
|
||||||
|
gtfile=str(gt_path) + ".html",
|
||||||
|
generate=GENERATE,
|
||||||
|
), "export to html"
|
||||||
|
@ -179,7 +179,7 @@ def test_guess_format(tmp_path):
|
|||||||
# Non-Docling JSON
|
# Non-Docling JSON
|
||||||
# TODO: Docling JSON is currently the single supported JSON flavor and the pipeline
|
# TODO: Docling JSON is currently the single supported JSON flavor and the pipeline
|
||||||
# will try to validate *any* JSON (based on suffix/MIME) as Docling JSON; proper
|
# will try to validate *any* JSON (based on suffix/MIME) as Docling JSON; proper
|
||||||
# disambiguation seen as part of https://github.com/DS4SD/docling/issues/802
|
# disambiguation seen as part of https://github.com/docling-project/docling/issues/802
|
||||||
test_str = "{}"
|
test_str = "{}"
|
||||||
stream = DocumentStream(name="test.json", stream=BytesIO(f"{test_str}".encode()))
|
stream = DocumentStream(name="test.json", stream=BytesIO(f"{test_str}".encode()))
|
||||||
assert dci._guess_format(stream) == InputFormat.JSON_DOCLING
|
assert dci._guess_format(stream) == InputFormat.JSON_DOCLING
|
||||||
|
Loading…
Reference in New Issue
Block a user