fix(docx): merged table cells not properly converted (#857)

* fix(docx): merged cells not properly converted

Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: add type hinting to docx backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
This commit is contained in:
Cesar Berrospi Ramis
2025-02-03 10:20:03 +01:00
committed by GitHub
parent eff16b62cc
commit 0cd81a8122
8 changed files with 2713 additions and 122 deletions

Binary file not shown.

View File

@@ -0,0 +1,75 @@
<!DOCTYPE html>
<html lang="en">
<head>
<link rel="icon" type="image/png"
href="https://ds4sd.github.io/docling/assets/logo.png"/>
<meta charset="UTF-8">
<title>
Powered by Docling
</title>
<style>
html {
background-color: LightGray;
}
body {
margin: 0 auto;
width:800px;
padding: 30px;
background-color: White;
font-family: Arial, sans-serif;
box-shadow: 10px 10px 10px grey;
}
figure{
display: block;
width: 100%;
margin: 0px;
margin-top: 10px;
margin-bottom: 10px;
}
img {
display: block;
margin: auto;
margin-top: 10px;
margin-bottom: 10px;
max-width: 640px;
max-height: 640px;
}
table {
min-width:500px;
background-color: White;
border-collapse: collapse;
cell-padding: 5px;
margin: auto;
margin-top: 10px;
margin-bottom: 10px;
}
th, td {
border: 1px solid black;
padding: 8px;
}
th {
font-weight: bold;
}
table tr:nth-child(even) td{
background-color: LightGray;
}
</style>
</head>
<h2>Test with tables</h2>
<p>A uniform table</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td>Cell 1.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.1</td><td>Cell 2.2</td></tr></tbody></table>
<p></p>
<p>A non-uniform table with horizontal spans</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td></tr></tbody></table>
<p></p>
<p>A non-uniform table with horizontal spans in inner columns</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td>Header 0.3</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td><td>Cell 1.3</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td><td>Cell 2.3</td></tr></tbody></table>
<p></p>
<p>A non-uniform table with vertical spans</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td></tr></tbody></table>
<p></p>
<p>A non-uniform table with all kinds of spans and empty cells</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td></td><td></td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td><td></td><td></td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td><td></td><td></td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td><td rowspan="3"></td><td></td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td><td rowspan="2">Merged Cell 4.4 5.4</td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td colspan="5"></td></tr><tr><td></td><td></td><td></td><td></td><td>Cell 8.4</td></tr></tbody></table>
<p></p>
<p></p>
</html>

View File

@@ -0,0 +1,19 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: section: group header-0
item-2 at level 2: section_header: Test with tables
item-3 at level 3: paragraph: A uniform table
item-4 at level 3: table with [3x3]
item-5 at level 3: paragraph:
item-6 at level 3: paragraph: A non-uniform table with horizontal spans
item-7 at level 3: table with [3x3]
item-8 at level 3: paragraph:
item-9 at level 3: paragraph: A non-uniform table with horizontal spans in inner columns
item-10 at level 3: table with [3x4]
item-11 at level 3: paragraph:
item-12 at level 3: paragraph: A non-uniform table with vertical spans
item-13 at level 3: table with [5x3]
item-14 at level 3: paragraph:
item-15 at level 3: paragraph: A non-uniform table with all kinds of spans and empty cells
item-16 at level 3: table with [9x5]
item-17 at level 3: paragraph:
item-18 at level 3: paragraph:

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,44 @@
## Test with tables
A uniform table
| Header 0.0 | Header 0.1 | Header 0.2 |
|--------------|--------------|--------------|
| Cell 1.0 | Cell 1.1 | Cell 1.2 |
| Cell 2.0 | Cell 2.1 | Cell 2.2 |
A non-uniform table with horizontal spans
| Header 0.0 | Header 0.1 | Header 0.2 |
|--------------|---------------------|---------------------|
| Cell 1.0 | Merged Cell 1.1 1.2 | Merged Cell 1.1 1.2 |
| Cell 2.0 | Merged Cell 2.1 2.2 | Merged Cell 2.1 2.2 |
A non-uniform table with horizontal spans in inner columns
| Header 0.0 | Header 0.1 | Header 0.2 | Header 0.3 |
|--------------|---------------------|---------------------|--------------|
| Cell 1.0 | Merged Cell 1.1 1.2 | Merged Cell 1.1 1.2 | Cell 1.3 |
| Cell 2.0 | Merged Cell 2.1 2.2 | Merged Cell 2.1 2.2 | Cell 2.3 |
A non-uniform table with vertical spans
| Header 0.0 | Header 0.1 | Header 0.2 |
|--------------|---------------------|--------------|
| Cell 1.0 | Merged Cell 1.1 2.1 | Cell 1.2 |
| Cell 2.0 | Merged Cell 1.1 2.1 | Cell 2.2 |
| Cell 3.0 | Merged Cell 3.1 4.1 | Cell 3.2 |
| Cell 4.0 | Merged Cell 3.1 4.1 | Cell 4.2 |
A non-uniform table with all kinds of spans and empty cells
| Header 0.0 | Header 0.1 | Header 0.2 | | |
|--------------|---------------------|--------------|----|---------------------|
| Cell 1.0 | Merged Cell 1.1 2.1 | Cell 1.2 | | |
| Cell 2.0 | Merged Cell 1.1 2.1 | Cell 2.2 | | |
| Cell 3.0 | Merged Cell 3.1 4.1 | Cell 3.2 | | |
| Cell 4.0 | Merged Cell 3.1 4.1 | Cell 4.2 | | Merged Cell 4.4 5.4 |
| | | | | Merged Cell 4.4 5.4 |
| | | | | |
| | | | | |
| | | | | Cell 8.4 |