Commit Graph

5 Commits

Author SHA1 Message Date
Cesar Berrospi Ramis
0cd81a8122
fix(docx): merged table cells not properly converted (#857)
* fix(docx): merged cells not properly converted

Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: add type hinting to docx backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-03 10:20:03 +01:00
Maxim Lysak
2c037ae62e
fix: Fixed docx import with headers that are also lists (#842)
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update docling/backend/msword_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>

* Update docling/backend/msword_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-31 10:51:21 +01:00
Maxim Lysak
d0a1180478
fix: Fixes for wordx (#432)
* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-26 14:44:43 +01:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend (#314)
* Handling of single-cell tables in DOCX backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* returned try-catch on tables handling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaned

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* proceed processing the content of single cell table as if its just part of the body

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added example of trickly 1 cell table docx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-12 15:20:55 +01:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx (#186)
* add real e2e tests for html and docx

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the output of itxt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the text

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the examples (1)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the output of the test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the tests, moved the ground-truth

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* moved the ground-truth data

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the html tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restructure title fix (#187)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-30 13:14:56 +01:00