fix(docx): slow table parsing (#2553)

* chore(docx): remove unnecessary import

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): simplify parsing of simple tables

Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): reuse method for finding inline pictures

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): format strikethrough text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): use fixtures to avoid converting same file multiple times

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): remove unnecessary argument docx_obj in functions

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): small improvements in backend and its unit tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): parse superscript and subscript formatted text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
Cesar Berrospi Ramis
2025-11-06 05:25:53 +01:00
committed by GitHub
parent 0ba8d5d9e3
commit ef623ffcee
6 changed files with 3366 additions and 218 deletions

View File

@@ -0,0 +1,25 @@
### Table with rich cells
| Column A | Column B |
|------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| This is a list: - A First - A Second - A Third | This is a formatted list: - B **First** - B *Second* - B Third |
| First Paragraph Second Paragraph Third paragraph before a numbered list 1. Number one 2. Number two 3. Number three | This is simple text with **bold** , ~~strikethrough~~ and *italic* formatting with x 2 and H 2 O |
| This is a paragraph This is another paragraph | |
### Table with nested table
Before table
| Column A | Column B |
|----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| Simple cell upper left | Simple cell with **bold** and *italic* text |
| | A | B | C | |----------|--------|------------| | *Cell 1* | Cell 2 | **Cell 3** | | Rich cell A nested table | A | B | C | |------------|--------------|--------| | ~~Cell 1~~ | ***Cell 2*** | Cell 3 | |
After table with **bold** , underline , ~~strikethrough~~ , and *italic* formatting
### Table with pictures
| Column A | Column B |
|----------------------------------|----------------|
| Only text | <!-- image --> |
| Text and picture <!-- image --> | |