Commit Graph

14 Commits

Author SHA1 Message Date
Cesar Berrospi Ramis
054c4a634d fix(docx): parse page headers and footers (#2599)
* fix(docx): parse page headers and footers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): rename _add_header with _add_heading

To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): extend the page header and footer parsing to any content type

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): fix _add_header_footer function

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-10 16:10:12 +01:00
Cesar Berrospi Ramis
ef623ffcee fix(docx): slow table parsing (#2553)
* chore(docx): remove unnecessary import

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): simplify parsing of simple tables

Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): reuse method for finding inline pictures

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): format strikethrough text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): use fixtures to avoid converting same file multiple times

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): remove unnecessary argument docx_obj in functions

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): small improvements in backend and its unit tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): parse superscript and subscript formatted text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:53 +01:00
Rafael Teixeira de Lima
16829939cf feat(docx): Process drawingml objects in docx (#2453)
* Export of DrawingML figures into docling document

* Adding libreoffice env var and libreoffice to checks image

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fffcad

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Enforcing apt get update

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Only display drawingml warning once per document

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* add util to test libreoffice and exclude files from test when not found

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* check libreoffice only once

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Only initialise converter if needed

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-15 10:58:08 +02:00
Rafael Teixeira de Lima
0b83609531 fix(docx): Adding plain latex equations to table cells (#1986)
* Adding plain latex equations to table cells

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding test files

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-07-24 11:02:24 +02:00
mkrssg
1350a8d3e5 fix(msword_backend): Identify text in the same line after an image #1425 (#1610)
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* fix: extraneous empty paragraphs for test files

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

---------

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>
2025-06-20 10:55:30 +02:00
Panos Vagenas
61d0d6c755 test: mark flaky test (#1698)
* test: cleanse Word test file

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* mark textbox file test as flaky

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix path usage

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-03 13:13:44 +02:00
AndrewTsai0406
12a0e64892 feat: add textbox content extraction in msword_backend (#1538)
* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

---------

Signed-off-by: Andrew <tsai247365@gmail.com>
2025-05-19 15:01:36 +02:00
Simon Jégou
bfcab3d677 feat(docx): add text formatting and hyperlink support (#630)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 52s
* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima
6eb718f849 feat: equations to latex in MSWord backend (with inline groups) (#1114)
* Equation groups

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Proper handling of orphan IDs in layout postprocessing (#1118)

* Fix the handling of orphan IDs in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: bump version to 2.25.2 [skip ci]

* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124)

add env var in docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(CLI): fix help message for abort options (#1130)

fix help message

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* perf: New revision code formula model and document picture classifier (#1140)

* new version code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new version document picture classifier

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* restored original code formula test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat: Use new TableFormer model weights and default to accurate model version (#1100)

* feat: New tableformer model weights [WIP]

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Updated TF version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated tests, after merging with Main, Switched to Accurate TF model by default

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: bump version to 2.26.0 [skip ci]

* fix: Pass tests, update docling-core to 2.22.0 (#1150)

fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Updating content hash

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis
0cd81a8122 fix(docx): merged table cells not properly converted (#857)
* fix(docx): merged cells not properly converted

Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: add type hinting to docx backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-03 10:20:03 +01:00
Maxim Lysak
2c037ae62e fix: Fixed docx import with headers that are also lists (#842)
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update docling/backend/msword_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>

* Update docling/backend/msword_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-31 10:51:21 +01:00
Maxim Lysak
d0a1180478 fix: Fixes for wordx (#432)
* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-26 14:44:43 +01:00
Maxim Lysak
fb8ba861e2 fix: Handling of single-cell tables in DOCX backend (#314)
* Handling of single-cell tables in DOCX backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* returned try-catch on tables handling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaned

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* proceed processing the content of single cell table as if its just part of the body

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added example of trickly 1 cell table docx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-12 15:20:55 +01:00
Peter W. J. Staar
f542460af3 fix: fix duplicate title and heading + add e2e tests for html and docx (#186)
* add real e2e tests for html and docx

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the output of itxt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the text

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the examples (1)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the output of the test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the tests, moved the ground-truth

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* moved the ground-truth data

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the html tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restructure title fix (#187)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-30 13:14:56 +01:00