mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
fix(docx): merged table cells not properly converted (#857)
* fix(docx): merged cells not properly converted Fix conversion issue of merged cells in Word tables leading to repeated text. Simplify Word table conversion code. Add docx file with several table formats for regression tests. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: add type hinting to docx backend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
eff16b62cc
commit
0cd81a8122
75
tests/data/groundtruth/docling_v2/word_tables.docx.html
Normal file
75
tests/data/groundtruth/docling_v2/word_tables.docx.html
Normal file
@@ -0,0 +1,75 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<link rel="icon" type="image/png"
|
||||
href="https://ds4sd.github.io/docling/assets/logo.png"/>
|
||||
<meta charset="UTF-8">
|
||||
<title>
|
||||
Powered by Docling
|
||||
</title>
|
||||
<style>
|
||||
html {
|
||||
background-color: LightGray;
|
||||
}
|
||||
body {
|
||||
margin: 0 auto;
|
||||
width:800px;
|
||||
padding: 30px;
|
||||
background-color: White;
|
||||
font-family: Arial, sans-serif;
|
||||
box-shadow: 10px 10px 10px grey;
|
||||
}
|
||||
figure{
|
||||
display: block;
|
||||
width: 100%;
|
||||
margin: 0px;
|
||||
margin-top: 10px;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
img {
|
||||
display: block;
|
||||
margin: auto;
|
||||
margin-top: 10px;
|
||||
margin-bottom: 10px;
|
||||
max-width: 640px;
|
||||
max-height: 640px;
|
||||
}
|
||||
table {
|
||||
min-width:500px;
|
||||
background-color: White;
|
||||
border-collapse: collapse;
|
||||
cell-padding: 5px;
|
||||
margin: auto;
|
||||
margin-top: 10px;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
th, td {
|
||||
border: 1px solid black;
|
||||
padding: 8px;
|
||||
}
|
||||
th {
|
||||
font-weight: bold;
|
||||
}
|
||||
table tr:nth-child(even) td{
|
||||
background-color: LightGray;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<h2>Test with tables</h2>
|
||||
<p>A uniform table</p>
|
||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td>Cell 1.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.1</td><td>Cell 2.2</td></tr></tbody></table>
|
||||
<p></p>
|
||||
<p>A non-uniform table with horizontal spans</p>
|
||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td></tr></tbody></table>
|
||||
<p></p>
|
||||
<p>A non-uniform table with horizontal spans in inner columns</p>
|
||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td>Header 0.3</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td><td>Cell 1.3</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td><td>Cell 2.3</td></tr></tbody></table>
|
||||
<p></p>
|
||||
<p>A non-uniform table with vertical spans</p>
|
||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td></tr></tbody></table>
|
||||
<p></p>
|
||||
<p>A non-uniform table with all kinds of spans and empty cells</p>
|
||||
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td></td><td></td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td><td></td><td></td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td><td></td><td></td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td><td rowspan="3"></td><td></td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td><td rowspan="2">Merged Cell 4.4 5.4</td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td colspan="5"></td></tr><tr><td></td><td></td><td></td><td></td><td>Cell 8.4</td></tr></tbody></table>
|
||||
<p></p>
|
||||
<p></p>
|
||||
</html>
|
||||
19
tests/data/groundtruth/docling_v2/word_tables.docx.itxt
Normal file
19
tests/data/groundtruth/docling_v2/word_tables.docx.itxt
Normal file
@@ -0,0 +1,19 @@
|
||||
item-0 at level 0: unspecified: group _root_
|
||||
item-1 at level 1: section: group header-0
|
||||
item-2 at level 2: section_header: Test with tables
|
||||
item-3 at level 3: paragraph: A uniform table
|
||||
item-4 at level 3: table with [3x3]
|
||||
item-5 at level 3: paragraph:
|
||||
item-6 at level 3: paragraph: A non-uniform table with horizontal spans
|
||||
item-7 at level 3: table with [3x3]
|
||||
item-8 at level 3: paragraph:
|
||||
item-9 at level 3: paragraph: A non-uniform table with horizontal spans in inner columns
|
||||
item-10 at level 3: table with [3x4]
|
||||
item-11 at level 3: paragraph:
|
||||
item-12 at level 3: paragraph: A non-uniform table with vertical spans
|
||||
item-13 at level 3: table with [5x3]
|
||||
item-14 at level 3: paragraph:
|
||||
item-15 at level 3: paragraph: A non-uniform table with all kinds of spans and empty cells
|
||||
item-16 at level 3: table with [9x5]
|
||||
item-17 at level 3: paragraph:
|
||||
item-18 at level 3: paragraph:
|
||||
2356
tests/data/groundtruth/docling_v2/word_tables.docx.json
Normal file
2356
tests/data/groundtruth/docling_v2/word_tables.docx.json
Normal file
File diff suppressed because it is too large
Load Diff
44
tests/data/groundtruth/docling_v2/word_tables.docx.md
Normal file
44
tests/data/groundtruth/docling_v2/word_tables.docx.md
Normal file
@@ -0,0 +1,44 @@
|
||||
## Test with tables
|
||||
|
||||
A uniform table
|
||||
|
||||
| Header 0.0 | Header 0.1 | Header 0.2 |
|
||||
|--------------|--------------|--------------|
|
||||
| Cell 1.0 | Cell 1.1 | Cell 1.2 |
|
||||
| Cell 2.0 | Cell 2.1 | Cell 2.2 |
|
||||
|
||||
A non-uniform table with horizontal spans
|
||||
|
||||
| Header 0.0 | Header 0.1 | Header 0.2 |
|
||||
|--------------|---------------------|---------------------|
|
||||
| Cell 1.0 | Merged Cell 1.1 1.2 | Merged Cell 1.1 1.2 |
|
||||
| Cell 2.0 | Merged Cell 2.1 2.2 | Merged Cell 2.1 2.2 |
|
||||
|
||||
A non-uniform table with horizontal spans in inner columns
|
||||
|
||||
| Header 0.0 | Header 0.1 | Header 0.2 | Header 0.3 |
|
||||
|--------------|---------------------|---------------------|--------------|
|
||||
| Cell 1.0 | Merged Cell 1.1 1.2 | Merged Cell 1.1 1.2 | Cell 1.3 |
|
||||
| Cell 2.0 | Merged Cell 2.1 2.2 | Merged Cell 2.1 2.2 | Cell 2.3 |
|
||||
|
||||
A non-uniform table with vertical spans
|
||||
|
||||
| Header 0.0 | Header 0.1 | Header 0.2 |
|
||||
|--------------|---------------------|--------------|
|
||||
| Cell 1.0 | Merged Cell 1.1 2.1 | Cell 1.2 |
|
||||
| Cell 2.0 | Merged Cell 1.1 2.1 | Cell 2.2 |
|
||||
| Cell 3.0 | Merged Cell 3.1 4.1 | Cell 3.2 |
|
||||
| Cell 4.0 | Merged Cell 3.1 4.1 | Cell 4.2 |
|
||||
|
||||
A non-uniform table with all kinds of spans and empty cells
|
||||
|
||||
| Header 0.0 | Header 0.1 | Header 0.2 | | |
|
||||
|--------------|---------------------|--------------|----|---------------------|
|
||||
| Cell 1.0 | Merged Cell 1.1 2.1 | Cell 1.2 | | |
|
||||
| Cell 2.0 | Merged Cell 1.1 2.1 | Cell 2.2 | | |
|
||||
| Cell 3.0 | Merged Cell 3.1 4.1 | Cell 3.2 | | |
|
||||
| Cell 4.0 | Merged Cell 3.1 4.1 | Cell 4.2 | | Merged Cell 4.4 5.4 |
|
||||
| | | | | Merged Cell 4.4 5.4 |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| | | | | Cell 8.4 |
|
||||
Reference in New Issue
Block a user