fix(docx): merged table cells not properly converted (#857)

* fix(docx): merged cells not properly converted

Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: add type hinting to docx backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
This commit is contained in:
Cesar Berrospi Ramis
2025-02-03 10:20:03 +01:00
committed by GitHub
parent eff16b62cc
commit 0cd81a8122
8 changed files with 2713 additions and 122 deletions

View File

@@ -69,7 +69,6 @@ def verify_export(pred_text: str, gtfile: str):
with open(gtfile, "r") as fr:
true_text = fr.read()
assert pred_text == true_text, "pred_itxt==true_itxt"
return pred_text == true_text
@@ -101,3 +100,7 @@ def test_e2e_docx_conversions():
pred_json: str = json.dumps(doc.export_to_dict(), indent=2)
assert verify_export(pred_json, str(gt_path) + ".json"), "export to json"
if docx_path.name == "word_tables.docx":
pred_html: str = doc.export_to_html()
assert verify_export(pred_html, str(gt_path) + ".html"), "export to html"