fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903)

fix: Support for RTL programmatic documents
fix(parser): detect and handle rotated pages
fix(parser): fix bug causing duplicated text
fix(formula): improve stopping criteria
chore: update lock file
fix: temporary constrain beautifulsoup


* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* cleaned up the data folder in the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added three test-files for right-to-left

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix black

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* Add code to expose text direction of cell

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* new test file

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix mypy reports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix example filepaths

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test data results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin wheel of latest docling-parse release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove debugging code

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix path to files in example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Revert unwanted RTL additions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix test data paths in examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
Michele Dolfi
2025-02-07 08:43:31 +01:00
committed by GitHub
parent ed74fe2ec0
commit 9114ada7bc
91 changed files with 620 additions and 313 deletions

View File

@@ -9,7 +9,7 @@ from docling.datamodel.settings import DocumentLimits
def test_in_doc_from_valid_path():
test_doc_path = Path("./tests/data/2206.01062.pdf")
test_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
doc = _make_input_doc(test_doc_path)
assert doc.valid == True
@@ -24,7 +24,7 @@ def test_in_doc_from_invalid_path():
def test_in_doc_from_valid_buf():
buf = BytesIO(Path("./tests/data/2206.01062.pdf").open("rb").read())
buf = BytesIO(Path("./tests/data/pdf/2206.01062.pdf").open("rb").read())
stream = DocumentStream(name="my_doc.pdf", stream=buf)
doc = _make_input_doc_from_stream(stream)
@@ -41,7 +41,7 @@ def test_in_doc_from_invalid_buf():
def test_in_doc_with_page_range():
test_doc_path = Path("./tests/data/2206.01062.pdf")
test_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
limits = DocumentLimits()
limits.page_range = (1, 10)
@@ -81,10 +81,10 @@ def test_guess_format(tmp_path):
temp_dir.mkdir()
# Valid PDF
buf = BytesIO(Path("./tests/data/2206.01062.pdf").open("rb").read())
buf = BytesIO(Path("./tests/data/pdf/2206.01062.pdf").open("rb").read())
stream = DocumentStream(name="my_doc.pdf", stream=buf)
assert dci._guess_format(stream) == InputFormat.PDF
doc_path = Path("./tests/data/2206.01062.pdf")
doc_path = Path("./tests/data/pdf/2206.01062.pdf")
assert dci._guess_format(doc_path) == InputFormat.PDF
# Valid MS Office