* fix(docx): parse page headers and footers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): rename _add_header with _add_heading
To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): extend the page header and footer parsing to any content type
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): fix _add_header_footer function
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): remove unnecessary import
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): simplify parsing of simple tables
Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): reuse method for finding inline pictures
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): format strikethrough text
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(docx): use fixtures to avoid converting same file multiple times
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): remove unnecessary argument docx_obj in functions
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(docx): add test for rich table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): small improvements in backend and its unit tests
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): parse superscript and subscript formatted text
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Export of DrawingML figures into docling document
* Adding libreoffice env var and libreoffice to checks image
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fffcad
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Enforcing apt get update
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Only display drawingml warning once per document
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* add util to test libreoffice and exclude files from test when not found
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* check libreoffice only once
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Only initialise converter if needed
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* fix: extraneous empty paragraphs for test files
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
---------
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>