mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-09 13:18:24 +00:00
fix(docx): slow table parsing (#2553)
* chore(docx): remove unnecessary import Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): simplify parsing of simple tables Simplify the parsing of tables with just text (no rich cells). Move nested function group_cell_elements out of _handle_tables for readability. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): reuse method for finding inline pictures Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): format strikethrough text Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(docx): use fixtures to avoid converting same file multiple times Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(docx): remove unnecessary argument docx_obj in functions Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(docx): add test for rich table cells Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): small improvements in backend and its unit tests Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(docx): parse superscript and subscript formatted text Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
committed by
GitHub
parent
0ba8d5d9e3
commit
ef623ffcee
25
tests/data/groundtruth/docling_v2/docx_rich_cells.docx.md
vendored
Normal file
25
tests/data/groundtruth/docling_v2/docx_rich_cells.docx.md
vendored
Normal file
@@ -0,0 +1,25 @@
|
||||
### Table with rich cells
|
||||
|
||||
| Column A | Column B |
|
||||
|------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
|
||||
| This is a list: - A First - A Second - A Third | This is a formatted list: - B **First** - B *Second* - B Third |
|
||||
| First Paragraph Second Paragraph Third paragraph before a numbered list 1. Number one 2. Number two 3. Number three | This is simple text with **bold** , ~~strikethrough~~ and *italic* formatting with x 2 and H 2 O |
|
||||
| This is a paragraph This is another paragraph | |
|
||||
|
||||
### Table with nested table
|
||||
|
||||
Before table
|
||||
|
||||
| Column A | Column B |
|
||||
|----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| Simple cell upper left | Simple cell with **bold** and *italic* text |
|
||||
| | A | B | C | |----------|--------|------------| | *Cell 1* | Cell 2 | **Cell 3** | | Rich cell A nested table | A | B | C | |------------|--------------|--------| | ~~Cell 1~~ | ***Cell 2*** | Cell 3 | |
|
||||
|
||||
After table with **bold** , underline , ~~strikethrough~~ , and *italic* formatting
|
||||
|
||||
### Table with pictures
|
||||
|
||||
| Column A | Column B |
|
||||
|----------------------------------|----------------|
|
||||
| Only text | <!-- image --> |
|
||||
| Text and picture <!-- image --> | |
|
||||
Reference in New Issue
Block a user