Files
docling/tests/data/groundtruth/docling_v2/docx_rich_cells.docx.md
Cesar Berrospi Ramis ef623ffcee fix(docx): slow table parsing (#2553)
* chore(docx): remove unnecessary import

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): simplify parsing of simple tables

Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): reuse method for finding inline pictures

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): format strikethrough text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): use fixtures to avoid converting same file multiple times

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): remove unnecessary argument docx_obj in functions

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): small improvements in backend and its unit tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): parse superscript and subscript formatted text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:53 +01:00

2.5 KiB
Vendored

Table with rich cells

Column A Column B
This is a list: - A First - A Second - A Third This is a formatted list: - B First - B Second - B Third
First Paragraph Second Paragraph Third paragraph before a numbered list 1. Number one 2. Number two 3. Number three This is simple text with bold , strikethrough and italic formatting with x 2 and H 2 O
This is a paragraph This is another paragraph

Table with nested table

Before table

Column A Column B
Simple cell upper left Simple cell with bold and italic text
A

After table with bold , underline , strikethrough , and italic formatting

Table with pictures

Column A Column B
Only text
Text and picture