update notebook to improve MD table rendering

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-07-27 04:24:45 +00:00 · 2025-05-08 20:38:19 +02:00 · 2025-05-08 20:38:19 +02:00 · 1295c85985
commit 1295c85985
parent 3031302208
1 changed files with 309 additions and 116 deletions
--- a/docs/examples/serialization.ipynb
+++ b/docs/examples/serialization.ipynb
@ -42,7 +42,7 @@
    }
   ],
   "source": [
-    "%pip install -qU pip docling docling-core~=2.29"
+    "%pip install -qU pip docling docling-core~=2.29 rich"
   ]
  },
  {
@ -54,8 +54,24 @@
    "DOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\"\n",
    "\n",
    "# we set some start-stop cues for defining an excerpt to print\n",
-    "start_cue_incl = \"Copyright © 2024\"\n",
+    "start_cue = \"Copyright © 2024\"\n",
-    "stop_cue_excl = \"Application of NLP to ESG\""
+    "stop_cue = \"Application of NLP to ESG\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rich.console import Console\n",
    "from rich.panel import Panel\n",
    "\n",
    "console = Console(width=210)  # for preventing Markdown table wrapped rendering\n",
    "\n",
    "\n",
    "def print_in_console(text):\n",
    "    console.print(Panel(text))"
   ]
  },
  {
@ -74,7 +90,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
@ -99,29 +115,66 @@
   "source": [
    "We can now apply any `BaseDocSerializer` on the produced document.\n",
    "\n",
    "👉 Note that, to keep the shown output brief, we only print an excerpt.\n",
    "\n",
    "E.g. below we apply an `HTMLDocSerializer`:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
-     "name": "stdout",
+     "data": {
-     "output_type": "stream",
+      "text/html": [
-     "text": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
-      "Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.</p>\n",
+       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.&lt;/p&gt;                                                                                          │\n",
-      "<table><tbody><tr><th>Report</th><th>Question</th><th>Answer</th></tr><tr><td>IBM 2022</td><td>How many hours were spent on employee learning in 2021?</td><td>22.5 million hours</td></tr><tr><td>IBM 2022</td><td>What was the rate of fatalities in 2021?</td><td>The rate of fatalities in 2021 was 0.0016.</td></tr><tr><td>IBM 2022</td><td>How many full audits were con- ducted in 2022 in India?</td><td>2</td></tr><tr><td>Starbucks 2022</td><td>What is the percentage of women in the Board of Directors?</td><td>25%</td></tr><tr><td>Starbucks 2022</td><td>What was the total energy con- sumption in 2021?</td><td>According to the table, the total energy consumption in 2021 was 2,491,543 MWh.</td></tr><tr><td>Starbucks 2022</td><td>How much packaging material was made from renewable mate- rials?</td><td>According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.</td></tr></tbody></table>\n",
+       "│ &lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Report&lt;/th&gt;&lt;th&gt;Question&lt;/th&gt;&lt;th&gt;Answer&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;IBM 2022&lt;/td&gt;&lt;td&gt;How many hours were spent on employee learning in 2021?&lt;/td&gt;&lt;td&gt;22.5 million hours&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;IBM         │\n",
-      "<p>Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.</p>\n",
+       "│ 2022&lt;/td&gt;&lt;td&gt;What was the rate of fatalities in 2021?&lt;/td&gt;&lt;td&gt;The rate of fatalities in 2021 was 0.0016.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;IBM 2022&lt;/td&gt;&lt;td&gt;How many full audits were con- ducted in 2022 in                    │\n",
-      "<p>ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the response.</p>\n",
+       "│ India?&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Starbucks 2022&lt;/td&gt;&lt;td&gt;What is the percentage of women in the Board of Directors?&lt;/td&gt;&lt;td&gt;25%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Starbucks 2022&lt;/td&gt;&lt;td&gt;What was the total energy con-         │\n",
-      "<h2>Related Work</h2>\n",
+       "│ sumption in 2021?&lt;/td&gt;&lt;td&gt;According to the table, the total energy consumption in 2021 was 2,491,543 MWh.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Starbucks 2022&lt;/td&gt;&lt;td&gt;How much packaging material was made from renewable mate-    │\n",
-      "<p>The DocQA integrates multiple AI technologies, namely:</p>\n",
+       "│ rials?&lt;/td&gt;&lt;td&gt;According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;                                                       │\n",
-      "<p>Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-</p>\n",
+       "│ &lt;p&gt;Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.&lt;/p&gt;                                                                                             │\n",
-      "<figure><figcaption>Figure 1: System architecture: Simplified sketch of document question-answering pipeline.</figcaption></figure>\n",
+       "│ &lt;p&gt;ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the   │\n",
-      "<p>based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).</p>\n",
+       "│ response.&lt;/p&gt;                                                                                                                                                                                                  │\n",
-      "<p>\n"
+       "│ &lt;h2&gt;Related Work&lt;/h2&gt;                                                                                                                                                                                          │\n",
       "│ &lt;p&gt;The DocQA integrates multiple AI technologies, namely:&lt;/p&gt;                                                                                                                                                  │\n",
       "│ &lt;p&gt;Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric     │\n",
       "│ layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et │\n",
       "│ al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .  │\n",
       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-&lt;/p&gt;                        │\n",
       "│ &lt;figure&gt;&lt;figcaption&gt;Figure 1: System architecture: Simplified sketch of document question-answering pipeline.&lt;/figcaption&gt;&lt;/figure&gt;                                                                            │\n",
       "│ &lt;p&gt;based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).&lt;/p&gt;                     │\n",
       "│ &lt;p&gt;                                                                                                                                                                                                            │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
       "</pre>\n"
      ],
      "text/plain": [
       "╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.</p>                                                                                          │\n",
       "│ <table><tbody><tr><th>Report</th><th>Question</th><th>Answer</th></tr><tr><td>IBM 2022</td><td>How many hours were spent on employee learning in 2021?</td><td>22.5 million hours</td></tr><tr><td>IBM         │\n",
       "│ 2022</td><td>What was the rate of fatalities in 2021?</td><td>The rate of fatalities in 2021 was 0.0016.</td></tr><tr><td>IBM 2022</td><td>How many full audits were con- ducted in 2022 in                    │\n",
       "│ India?</td><td>2</td></tr><tr><td>Starbucks 2022</td><td>What is the percentage of women in the Board of Directors?</td><td>25%</td></tr><tr><td>Starbucks 2022</td><td>What was the total energy con-         │\n",
       "│ sumption in 2021?</td><td>According to the table, the total energy consumption in 2021 was 2,491,543 MWh.</td></tr><tr><td>Starbucks 2022</td><td>How much packaging material was made from renewable mate-    │\n",
       "│ rials?</td><td>According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.</td></tr></tbody></table>                                                       │\n",
       "│ <p>Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.</p>                                                                                             │\n",
       "│ <p>ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the   │\n",
       "│ response.</p>                                                                                                                                                                                                  │\n",
       "│ <h2>Related Work</h2>                                                                                                                                                                                          │\n",
       "│ <p>The DocQA integrates multiple AI technologies, namely:</p>                                                                                                                                                  │\n",
       "│ <p>Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric     │\n",
       "│ layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et │\n",
       "│ al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .  │\n",
       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-</p>                        │\n",
       "│ <figure><figcaption>Figure 1: System architecture: Simplified sketch of document question-answering pipeline.</figcaption></figure>                                                                            │\n",
       "│ <p>based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).</p>                     │\n",
       "│ <p>                                                                                                                                                                                                            │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
@ -131,7 +184,8 @@
    "ser_result = serializer.serialize()\n",
    "ser_text = ser_result.text\n",
    "\n",
-    "print(ser_text[ser_text.find(start_cue_incl) : ser_text.find(stop_cue_excl)])"
+    "# we here only print an excerpt to keep the output brief:\n",
    "print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])"
   ]
  },
  {
@ -143,42 +197,87 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
-     "name": "stdout",
+     "data": {
-     "output_type": "stream",
+      "text/html": [
-     "text": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
-      "Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.\n",
+       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "| Report         | Question                                                         | Answer                                                                                                          |\n",
+       "│ | Report         | Question                                                         | Answer                                                                                                          |        │\n",
-      "|----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|\n",
+       "│ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|        │\n",
-      "| IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |\n",
+       "│ | IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |        │\n",
-      "| IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |\n",
+       "│ | IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |        │\n",
-      "| IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |\n",
+       "│ | IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |        │\n",
-      "| Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |\n",
+       "│ | Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |        │\n",
-      "| Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |\n",
+       "│ | Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |        │\n",
-      "| Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |\n",
+       "│ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |        │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.\n",
+       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the response.\n",
+       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
-      "\n",
+       "│ response.                                                                                                                                                                                                      │\n",
-      "## Related Work\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ ## Related Work                                                                                                                                                                                                │\n",
-      "The DocQA integrates multiple AI technologies, namely:\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
-      "Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
-      "Figure 1: System architecture: Simplified sketch of document question-answering pipeline.\n",
+       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
-      "\n",
+       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
-      "<!-- image -->\n",
+       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).\n",
+       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n"
+       "│ &lt;!-- image --&gt;                                                                                                                                                                                                 │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
       "</pre>\n"
      ],
      "text/plain": [
       "╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ | Report         | Question                                                         | Answer                                                                                                          |        │\n",
       "│ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|        │\n",
       "│ | IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |        │\n",
       "│ | IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |        │\n",
       "│ | IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |        │\n",
       "│ | Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |        │\n",
       "│ | Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |        │\n",
       "│ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |        │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
       "│ response.                                                                                                                                                                                                      │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ ## Related Work                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ <!-- image -->                                                                                                                                                                                                 │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
@ -188,7 +287,7 @@
    "ser_result = serializer.serialize()\n",
    "ser_text = ser_result.text\n",
    "\n",
-    "print(ser_text[ser_text.find(start_cue_incl) : ser_text.find(stop_cue_excl)])"
+    "print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])"
   ]
  },
  {
@ -211,35 +310,81 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
-     "name": "stdout",
+     "data": {
-     "output_type": "stream",
+      "text/html": [
-     "text": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
-      "Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.\n",
+       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.\n",
+       "│ IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The     │\n",
-      "\n",
+       "│ rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the  │\n",
-      "Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.\n",
+       "│ Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption  │\n",
-      "\n",
+       "│ in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were │\n",
-      "ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the response.\n",
+       "│ made from recycled or renewable materials in FY22.                                                                                                                                                             │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "## Related Work\n",
+       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "The DocQA integrates multiple AI technologies, namely:\n",
+       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
-      "\n",
+       "│ response.                                                                                                                                                                                                      │\n",
-      "Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ ## Related Work                                                                                                                                                                                                │\n",
-      "Figure 1: System architecture: Simplified sketch of document question-answering pipeline.\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
-      "<!-- demo picture placeholder -->\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
-      "based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).\n",
+       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
-      "\n",
+       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
-      "\n"
+       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ &lt;!-- demo picture placeholder --&gt;                                                                                                                                                                              │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
       "</pre>\n"
      ],
      "text/plain": [
       "╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The     │\n",
       "│ rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the  │\n",
       "│ Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption  │\n",
       "│ in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were │\n",
       "│ made from recycled or renewable materials in FY22.                                                                                                                                                             │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
       "│ response.                                                                                                                                                                                                      │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ ## Related Work                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ <!-- demo picture placeholder -->                                                                                                                                                                              │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
@ -257,7 +402,7 @@
    "ser_result = serializer.serialize()\n",
    "ser_text = ser_result.text\n",
    "\n",
-    "print(ser_text[ser_text.find(start_cue_incl) : ser_text.find(stop_cue_excl)])"
+    "print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])"
   ]
  },
  {
@ -283,7 +428,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
@ -328,7 +473,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
@ -395,41 +540,89 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
-     "name": "stdout",
+     "data": {
-     "output_type": "stream",
+      "text/html": [
-     "text": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
-      "Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.\n",
+       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "| Report         | Question                                                         | Answer                                                                                                          |\n",
+       "│ | Report         | Question                                                         | Answer                                                                                                          |        │\n",
-      "|----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|\n",
+       "│ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|        │\n",
-      "| IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |\n",
+       "│ | IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |        │\n",
-      "| IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |\n",
+       "│ | IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |        │\n",
-      "| IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |\n",
+       "│ | IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |        │\n",
-      "| Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |\n",
+       "│ | Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |        │\n",
-      "| Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |\n",
+       "│ | Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |        │\n",
-      "| Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |\n",
+       "│ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |        │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.\n",
+       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
-      "\n",
+       "│                                                                                                                                                                                                                │\n",
-      "ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the response.\n",
+       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
-      "\n",
+       "│ response.                                                                                                                                                                                                      │\n",
-      "## Related Work\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ ## Related Work                                                                                                                                                                                                │\n",
-      "The DocQA integrates multiple AI technologies, namely:\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
-      "Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
-      "Figure 1: System architecture: Simplified sketch of document question-answering pipeline.\n",
+       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
-      "<!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response generation step involves generating a response from the information retrieval step. -->\n",
+       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
-      "\n",
+       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
-      "based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).\n",
+       "│                                                                                                                                                                                                                │\n",
-      "\n",
+       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
-      "\n"
+       "│ &lt;!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document           │\n",
       "│ conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response        │\n",
       "│ generation step involves generating a response from the information retrieval step. --&gt;                                                                                                                        │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
       "</pre>\n"
      ],
      "text/plain": [
       "╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ | Report         | Question                                                         | Answer                                                                                                          |        │\n",
       "│ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|        │\n",
       "│ | IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |        │\n",
       "│ | IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |        │\n",
       "│ | IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |        │\n",
       "│ | Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |        │\n",
       "│ | Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |        │\n",
       "│ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |        │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
       "│ response.                                                                                                                                                                                                      │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ ## Related Work                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
       "│ <!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document           │\n",
       "│ conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response        │\n",
       "│ generation step involves generating a response from the information retrieval step. -->                                                                                                                        │\n",
       "│                                                                                                                                                                                                                │\n",
       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
       "│                                                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                                                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
@ -444,7 +637,7 @@
    "ser_result = serializer.serialize()\n",
    "ser_text = ser_result.text\n",
    "\n",
-    "print(ser_text[ser_text.find(start_cue_incl) : ser_text.find(stop_cue_excl)])"
+    "print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])"
   ]
  }
 ],