fix: Fixing images in the input Word files (#330)

* Fixing images identification in the input Word files Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Populating extracted image data into docling picture for wordx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed base64 dependency in msword_backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-12-08 20:58:11 +00:00 · 2024-11-14 13:33:34 +01:00
parent bf2a85f1d4
commit 8533039b0c
4 changed files with 107 additions and 78 deletions
--- a/tests/data/groundtruth/docling_v2/word_sample.docx.itxt
+++ b/tests/data/groundtruth/docling_v2/word_sample.docx.itxt
@@ -2,7 +2,7 @@ item-0 at level 0: unspecified: group _root_
  item-1 at level 1: paragraph: Summer activities
  item-2 at level 1: title: Swimming in the lake
    item-3 at level 2: paragraph: Duck
-    item-4 at level 2: paragraph: 
+    item-4 at level 2: picture
    item-5 at level 2: paragraph: Figure 1: This is a cute duckling
    item-6 at level 2: section_header: Let’s swim!
      item-7 at level 3: paragraph: To get started with swimming, fi ...  down in a water and try not to drown:
--- a/tests/data/groundtruth/docling_v2/word_sample.docx.json
+++ b/tests/data/groundtruth/docling_v2/word_sample.docx.json
--- a/tests/data/groundtruth/docling_v2/word_sample.docx.md
+++ b/tests/data/groundtruth/docling_v2/word_sample.docx.md
@@ -4,6 +4,8 @@ Summer activities

 Duck

+<!-- image -->
+
 Figure 1: This is a cute duckling

 ## Let’s swim!