{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simple PDF conversion and result inspection" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "# requirements for this example:\n", "%pip install -qq docling rich" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import json\n", "import warnings\n", "from pathlib import Path\n", "from tempfile import TemporaryDirectory\n", "\n", "import rich\n", "\n", "warnings.filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic|torch\")\n", "warnings.filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert the PDF document" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5abee618f233461abf6eaea7603cc1c1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Fetching 7 files: 0%| | 0/7 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from docling.document_converter import DocumentConverter\n", "\n", "source = \"https://arxiv.org/pdf/2206.01062\" # DocLayNet paper\n", "converter = DocumentConverter()\n", "conv_res = converter.convert_single(source)\n", "doc = conv_res.output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect the native doc representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we will inspect the document representation produced by the conversion." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "doc_dict = doc.model_dump()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a full inspection, run the cell below, providing you with a complete JSON file:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Your file is located at: /var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmpre5js_vg/doc.json\n" ] } ], "source": [ "file_path = Path(f\"{(json_tmp_dir := TemporaryDirectory()).name}\") / \"doc.json\"\n", "with open(file_path, \"w\") as f:\n", " json.dump(doc_dict, f, indent=4)\n", "print(f\"Your file is located at: {file_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For space considerations, below we only display a part between a start and an end position of the `main-text` key.\n", "\n", "👉 Notice the various layout metadata extracted: *types* (e.g. section headers, paragraphs), *page numbers*, *bounding boxes* etc. \n", "\n", "(You may also notice there is a *table reference*, but the actual content is not shown below, as it is outside `main-text`; it can be inspected in the full file above though.)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
[\n", " {\n", " 'text': 'Baselines for Object Detection',\n", " 'type': 'subtitle-level-1',\n", " 'name': 'Section-header',\n", " 'font': None,\n", " 'prov': [\n", " {\n", " 'bbox': [317.1941223144531, 284.5037841796875, 466.8532409667969, 295.42913818359375],\n", " 'page': 6,\n", " 'span': [0, 30],\n", " '__ref_s3_data': None\n", " }\n", " ]\n", " },\n", " {\n", " 'text': 'In Table 2, we present baseline experiments (given in mAP) on Mask R-CNN [12], Faster R-CNN [11], \n", "and YOLOv5 [13]. Both training and evaluation were performed on RGB images with dimensions of 1025 × 1025 pixels. \n", "For training, we only used one annotation in case of redundantly annotated pages. As one can observe, the variation\n", "in mAP between the models is rather low, but overall between 6 and 10% lower than the mAP computed from the \n", "pairwise human annotations on triple-annotated pages. This gives a good indication that the DocLayNet dataset poses\n", "a worthwhile challenge for the research community to close the gap between human recognition and ML approaches. It \n", "is interesting to see that Mask R-CNN and Faster R-CNN produce very comparable mAP scores, indicating that \n", "pixel-based image segmentation derived from bounding-boxes does not help to obtain better predictions. On the other\n", "hand, the more recent Yolov5x model does very well and even out-performs humans on selected labels such as Text , \n", "Table and Picture . This is not entirely surprising, as Text , Table and Picture are abundant and the most visually\n", "distinctive in a document.',\n", " 'type': 'paragraph',\n", " 'name': 'Text',\n", " 'font': None,\n", " 'prov': [\n", " {\n", " 'bbox': [317.0144348144531, 85.2998275756836, 558.7822875976562, 280.8944396972656],\n", " 'page': 6,\n", " 'span': [0, 1146],\n", " '__ref_s3_data': None\n", " }\n", " ]\n", " },\n", " {\n", " 'text': 'DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis',\n", " 'type': 'page-header',\n", " 'name': 'Page-header',\n", " 'font': None,\n", " 'prov': [\n", " {\n", " 'bbox': [53.35094451904297, 722.9555053710938, 347.0172424316406, 732.038818359375],\n", " 'page': 7,\n", " 'span': [0, 71],\n", " '__ref_s3_data': None\n", " }\n", " ]\n", " },\n", " {\n", " 'text': 'KDD ’22, August 14-18, 2022, Washington, DC, USA',\n", " 'type': 'page-header',\n", " 'name': 'Page-header',\n", " 'font': None,\n", " 'prov': [\n", " {\n", " 'bbox': [365.1936950683594, 723.0802001953125, 558.7797241210938, 731.8773803710938],\n", " 'page': 7,\n", " 'span': [0, 48],\n", " '__ref_s3_data': None\n", " }\n", " ]\n", " },\n", " {\n", " 'text': 'Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with \n", "different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.',\n", " 'type': 'caption',\n", " 'name': 'Caption',\n", " 'font': None,\n", " 'prov': [\n", " {\n", " 'bbox': [52.8690299987793, 663.3739624023438, 295.6486511230469, 705.8510131835938],\n", " 'page': 7,\n", " 'span': [0, 205],\n", " '__ref_s3_data': None\n", " }\n", " ]\n", " },\n", " {'name': 'Table', 'type': 'table', '$ref': '#/tables/2'},\n", " {\n", " 'text': 'Learning Curve',\n", " 'type': 'subtitle-level-1',\n", " 'name': 'Section-header',\n", " 'font': None,\n", " 'prov': [\n", " {\n", " 'bbox': [53.446834564208984, 461.592041015625, 131.05624389648438, 472.6955871582031],\n", " 'page': 7,\n", " 'span': [0, 14],\n", " '__ref_s3_data': None\n", " }\n", " ]\n", " },\n", " {\n", " 'text': 'One of the fundamental questions related to any dataset is if it is \"large enough\". To answer this\n", "question for DocLayNet, we performed a data ablation study in which we evaluated a Mask R-CNN model trained on \n", "increasing fractions of the DocLayNet dataset. As can be seen in Figure 5, the mAP score rises sharply in the \n", "beginning and eventually levels out. To estimate the error-bar on the metrics, we ran the training five times on \n", "the entire data-set. This resulted in a 1% error-bar, depicted by the shaded area in Figure 5. In the inset of \n", "Figure 5, we show the exact same data-points, but with a logarithmic scale on the x-axis. As is expected, the mAP \n", "score increases linearly as a function of the data-size in the inset. The curve ultimately flattens out between the\n", "80% and 100% mark, with the 80% mark falling within the error-bars of the 100% mark. This provides a good \n", "indication that the model would not improve significantly by yet increasing the data size. Rather, it would \n", "probably benefit more from improved data consistency (as discussed in Section 3), data augmentation methods [23], \n", "or the addition of more document categories and styles.',\n", " 'type': 'paragraph',\n", " 'name': 'Text',\n", " 'font': None,\n", " 'prov': [\n", " {\n", " 'bbox': [52.78499984741211, 262.38037109375, 295.558349609375, 457.72955322265625],\n", " 'page': 7,\n", " 'span': [0, 1157],\n", " '__ref_s3_data': None\n", " }\n", " ]\n", " }\n", "]\n", "\n" ], "text/plain": [ "\u001b[1m[\u001b[0m\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'text'\u001b[0m: \u001b[32m'Baselines for Object Detection'\u001b[0m,\n", " \u001b[32m'type'\u001b[0m: \u001b[32m'subtitle-level-1'\u001b[0m,\n", " \u001b[32m'name'\u001b[0m: \u001b[32m'Section-header'\u001b[0m,\n", " \u001b[32m'font'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'prov'\u001b[0m: \u001b[1m[\u001b[0m\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'bbox'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m317.1941223144531\u001b[0m, \u001b[1;36m284.5037841796875\u001b[0m, \u001b[1;36m466.8532409667969\u001b[0m, \u001b[1;36m295.42913818359375\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'page'\u001b[0m: \u001b[1;36m6\u001b[0m,\n", " \u001b[32m'span'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m30\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'__ref_s3_data'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m]\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'text'\u001b[0m: \u001b[32m'In Table 2, we present baseline experiments \u001b[0m\u001b[32m(\u001b[0m\u001b[32mgiven in mAP\u001b[0m\u001b[32m)\u001b[0m\u001b[32m on Mask R-CNN \u001b[0m\u001b[32m[\u001b[0m\u001b[32m12\u001b[0m\u001b[32m]\u001b[0m\u001b[32m, Faster R-CNN \u001b[0m\u001b[32m[\u001b[0m\u001b[32m11\u001b[0m\u001b[32m]\u001b[0m\u001b[32m, \u001b[0m\n", "\u001b[32mand YOLOv5 \u001b[0m\u001b[32m[\u001b[0m\u001b[32m13\u001b[0m\u001b[32m]\u001b[0m\u001b[32m. Both training and evaluation were performed on RGB images with dimensions of 1025 × 1025 pixels. \u001b[0m\n", "\u001b[32mFor training, we only used one annotation in case of redundantly annotated pages. As one can observe, the variation\u001b[0m\n", "\u001b[32min mAP between the models is rather low, but overall between 6 and 10% lower than the mAP computed from the \u001b[0m\n", "\u001b[32mpairwise human annotations on triple-annotated pages. This gives a good indication that the DocLayNet dataset poses\u001b[0m\n", "\u001b[32ma worthwhile challenge for the research community to close the gap between human recognition and ML approaches. It \u001b[0m\n", "\u001b[32mis interesting to see that Mask R-CNN and Faster R-CNN produce very comparable mAP scores, indicating that \u001b[0m\n", "\u001b[32mpixel-based image segmentation derived from bounding-boxes does not help to obtain better predictions. On the other\u001b[0m\n", "\u001b[32mhand, the more recent Yolov5x model does very well and even out-performs humans on selected labels such as Text , \u001b[0m\n", "\u001b[32mTable and Picture . This is not entirely surprising, as Text , Table and Picture are abundant and the most visually\u001b[0m\n", "\u001b[32mdistinctive in a document.'\u001b[0m,\n", " \u001b[32m'type'\u001b[0m: \u001b[32m'paragraph'\u001b[0m,\n", " \u001b[32m'name'\u001b[0m: \u001b[32m'Text'\u001b[0m,\n", " \u001b[32m'font'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'prov'\u001b[0m: \u001b[1m[\u001b[0m\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'bbox'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m317.0144348144531\u001b[0m, \u001b[1;36m85.2998275756836\u001b[0m, \u001b[1;36m558.7822875976562\u001b[0m, \u001b[1;36m280.8944396972656\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'page'\u001b[0m: \u001b[1;36m6\u001b[0m,\n", " \u001b[32m'span'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1146\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'__ref_s3_data'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m]\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'text'\u001b[0m: \u001b[32m'DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis'\u001b[0m,\n", " \u001b[32m'type'\u001b[0m: \u001b[32m'page-header'\u001b[0m,\n", " \u001b[32m'name'\u001b[0m: \u001b[32m'Page-header'\u001b[0m,\n", " \u001b[32m'font'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'prov'\u001b[0m: \u001b[1m[\u001b[0m\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'bbox'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m53.35094451904297\u001b[0m, \u001b[1;36m722.9555053710938\u001b[0m, \u001b[1;36m347.0172424316406\u001b[0m, \u001b[1;36m732.038818359375\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'page'\u001b[0m: \u001b[1;36m7\u001b[0m,\n", " \u001b[32m'span'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m71\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'__ref_s3_data'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m]\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'text'\u001b[0m: \u001b[32m'KDD ’22, August 14-18, 2022, Washington, DC, USA'\u001b[0m,\n", " \u001b[32m'type'\u001b[0m: \u001b[32m'page-header'\u001b[0m,\n", " \u001b[32m'name'\u001b[0m: \u001b[32m'Page-header'\u001b[0m,\n", " \u001b[32m'font'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'prov'\u001b[0m: \u001b[1m[\u001b[0m\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'bbox'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m365.1936950683594\u001b[0m, \u001b[1;36m723.0802001953125\u001b[0m, \u001b[1;36m558.7797241210938\u001b[0m, \u001b[1;36m731.8773803710938\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'page'\u001b[0m: \u001b[1;36m7\u001b[0m,\n", " \u001b[32m'span'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m48\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'__ref_s3_data'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m]\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'text'\u001b[0m: \u001b[32m'Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with \u001b[0m\n", "\u001b[32mdifferent class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.'\u001b[0m,\n", " \u001b[32m'type'\u001b[0m: \u001b[32m'caption'\u001b[0m,\n", " \u001b[32m'name'\u001b[0m: \u001b[32m'Caption'\u001b[0m,\n", " \u001b[32m'font'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'prov'\u001b[0m: \u001b[1m[\u001b[0m\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'bbox'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m52.8690299987793\u001b[0m, \u001b[1;36m663.3739624023438\u001b[0m, \u001b[1;36m295.6486511230469\u001b[0m, \u001b[1;36m705.8510131835938\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'page'\u001b[0m: \u001b[1;36m7\u001b[0m,\n", " \u001b[32m'span'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m205\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'__ref_s3_data'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m]\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[1m{\u001b[0m\u001b[32m'name'\u001b[0m: \u001b[32m'Table'\u001b[0m, \u001b[32m'type'\u001b[0m: \u001b[32m'table'\u001b[0m, \u001b[32m'$ref'\u001b[0m: \u001b[32m'#/tables/2'\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'text'\u001b[0m: \u001b[32m'Learning Curve'\u001b[0m,\n", " \u001b[32m'type'\u001b[0m: \u001b[32m'subtitle-level-1'\u001b[0m,\n", " \u001b[32m'name'\u001b[0m: \u001b[32m'Section-header'\u001b[0m,\n", " \u001b[32m'font'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'prov'\u001b[0m: \u001b[1m[\u001b[0m\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'bbox'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m53.446834564208984\u001b[0m, \u001b[1;36m461.592041015625\u001b[0m, \u001b[1;36m131.05624389648438\u001b[0m, \u001b[1;36m472.6955871582031\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'page'\u001b[0m: \u001b[1;36m7\u001b[0m,\n", " \u001b[32m'span'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m14\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'__ref_s3_data'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m]\u001b[0m\n", " \u001b[1m}\u001b[0m,\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'text'\u001b[0m: \u001b[32m'One of the fundamental questions related to any dataset is if it is \"large enough\". To answer this\u001b[0m\n", "\u001b[32mquestion for DocLayNet, we performed a data ablation study in which we evaluated a Mask R-CNN model trained on \u001b[0m\n", "\u001b[32mincreasing fractions of the DocLayNet dataset. As can be seen in Figure 5, the mAP score rises sharply in the \u001b[0m\n", "\u001b[32mbeginning and eventually levels out. To estimate the error-bar on the metrics, we ran the training five times on \u001b[0m\n", "\u001b[32mthe entire data-set. This resulted in a 1% error-bar, depicted by the shaded area in Figure 5. In the inset of \u001b[0m\n", "\u001b[32mFigure 5, we show the exact same data-points, but with a logarithmic scale on the x-axis. As is expected, the mAP \u001b[0m\n", "\u001b[32mscore increases linearly as a function of the data-size in the inset. The curve ultimately flattens out between the\u001b[0m\n", "\u001b[32m80% and 100% mark, with the 80% mark falling within the error-bars of the 100% mark. This provides a good \u001b[0m\n", "\u001b[32mindication that the model would not improve significantly by yet increasing the data size. Rather, it would \u001b[0m\n", "\u001b[32mprobably benefit more from improved data consistency \u001b[0m\u001b[32m(\u001b[0m\u001b[32mas discussed in Section 3\u001b[0m\u001b[32m)\u001b[0m\u001b[32m, data augmentation methods \u001b[0m\u001b[32m[\u001b[0m\u001b[32m23\u001b[0m\u001b[32m]\u001b[0m\u001b[32m, \u001b[0m\n", "\u001b[32mor the addition of more document categories and styles.'\u001b[0m,\n", " \u001b[32m'type'\u001b[0m: \u001b[32m'paragraph'\u001b[0m,\n", " \u001b[32m'name'\u001b[0m: \u001b[32m'Text'\u001b[0m,\n", " \u001b[32m'font'\u001b[0m: \u001b[3;35mNone\u001b[0m,\n", " \u001b[32m'prov'\u001b[0m: \u001b[1m[\u001b[0m\n", " \u001b[1m{\u001b[0m\n", " \u001b[32m'bbox'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m52.78499984741211\u001b[0m, \u001b[1;36m262.38037109375\u001b[0m, \u001b[1;36m295.558349609375\u001b[0m, \u001b[1;36m457.72955322265625\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'page'\u001b[0m: \u001b[1;36m7\u001b[0m,\n", " \u001b[32m'span'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1157\u001b[0m\u001b[1m]\u001b[0m,\n", " \u001b[32m'__ref_s3_data'\u001b[0m: \u001b[3;35mNone\u001b[0m\n", " \u001b[1m}\u001b[0m\n", " \u001b[1m]\u001b[0m\n", " \u001b[1m}\u001b[0m\n", "\u001b[1m]\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "START_POS, END_POS = 90, 98\n", "rich.print(doc_dict[\"main-text\"][START_POS:END_POS])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect the Markdown export" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we look at the Markdown export." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a full inspection, run the cell below, providing you with a complete Markdown file:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Your file is located at: /var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp9j_sqqqd/doc.md\n" ] } ], "source": [ "doc_slice_md = doc.export_to_markdown()\n", "md_path = Path(f\"{(md_tmp_dir := TemporaryDirectory()).name}\") / \"doc.md\"\n", "with open(md_path, \"w\") as f:\n", " f.write(doc_slice_md)\n", "print(f\"Your file is located at: {md_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For space considerations, we here only display a part — using the same start and end position as further above:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "## Baselines for Object Detection\n", "\n", "In Table 2, we present baseline experiments (given in mAP) on Mask R-CNN [12], Faster R-CNN [11], and YOLOv5 [13]. Both training and evaluation were performed on RGB images with dimensions of 1025 × 1025 pixels. For training, we only used one annotation in case of redundantly annotated pages. As one can observe, the variation in mAP between the models is rather low, but overall between 6 and 10% lower than the mAP computed from the pairwise human annotations on triple-annotated pages. This gives a good indication that the DocLayNet dataset poses a worthwhile challenge for the research community to close the gap between human recognition and ML approaches. It is interesting to see that Mask R-CNN and Faster R-CNN produce very comparable mAP scores, indicating that pixel-based image segmentation derived from bounding-boxes does not help to obtain better predictions. On the other hand, the more recent Yolov5x model does very well and even out-performs humans on selected labels such as Text , Table and Picture . This is not entirely surprising, as Text , Table and Picture are abundant and the most visually distinctive in a document.\n", "\n", "Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.\n", "\n", "| Class-count | 11 | 6 | 5 | 4 |\n", "|----------------|------|---------|---------|---------|\n", "| Caption | 68 | Text | Text | Text |\n", "| Footnote | 71 | Text | Text | Text |\n", "| Formula | 60 | Text | Text | Text |\n", "| List-item | 81 | Text | 82 | Text |\n", "| Page-footer | 62 | 62 | - | - |\n", "| Page-header | 72 | 68 | - | - |\n", "| Picture | 72 | 72 | 72 | 72 |\n", "| Section-header | 68 | 67 | 69 | 68 |\n", "| Table | 82 | 83 | 82 | 82 |\n", "| Text | 85 | 84 | 84 | 84 |\n", "| Title | 77 | Sec.-h. | Sec.-h. | Sec.-h. |\n", "| Overall | 72 | 73 | 78 | 77 |\n", "\n", "## Learning Curve\n", "\n", "One of the fundamental questions related to any dataset is if it is \"large enough\". To answer this question for DocLayNet, we performed a data ablation study in which we evaluated a Mask R-CNN model trained on increasing fractions of the DocLayNet dataset. As can be seen in Figure 5, the mAP score rises sharply in the beginning and eventually levels out. To estimate the error-bar on the metrics, we ran the training five times on the entire data-set. This resulted in a 1% error-bar, depicted by the shaded area in Figure 5. In the inset of Figure 5, we show the exact same data-points, but with a logarithmic scale on the x-axis. As is expected, the mAP score increases linearly as a function of the data-size in the inset. The curve ultimately flattens out between the 80% and 100% mark, with the 80% mark falling within the error-bars of the 100% mark. This provides a good indication that the model would not improve significantly by yet increasing the data size. Rather, it would probably benefit more from improved data consistency (as discussed in Section 3), data augmentation methods [23], or the addition of more document categories and styles." ], "text/plain": [ "