## DocLayNet: A Large Human-Annotated Dataset for Birgit Pfitzmann Christoph Auer Michele Dolfi Ahmed S. Nassar ## ABSTRACT Accurate document layout analysis is a key requirement for high- ## CCS CONCEPTS • Permission to make digital or hard copies of part or all of this work for personal or https://doi.org/10.1145/3534678.3539043 Peter Staar Figure 1: ## KEYWORDS PDF document conversion, layout segmentation, object-detection, ## ACMReference Format: Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter ## 1 Despite the substantial improvements achieved with machine-learning Akeyproblem in the process of document conversion is to under- In this paper, we present the DocLayNet dataset. It provides page- - (1) - (2) - (3) - (4) This enables experimentation with annotation uncertainty - (5) All aspects outlined above are detailed in Section 3. In Section 4, In Section 5, we will present baseline accuracy numbers for a ## 2 While early approaches in document-layout analysis used rule- Lately, new types of ML models for document-layout analysis ## 3 DocLayNet contains 80863 PDF pages. Among these, 7059 carry two In addition to open intellectual property constraints for the Figure 2: Distribution of DocLayNet pages across document to a minimum, since they introduce difficulties in annotation (see The pages in DocLayNet can be grouped into six distinct cate- We did not control the document selection with regard to lan- To ensure that future benchmarks in the document-layout analy- Table 1 shows the overall frequency and distribution of the labels In order to accommodate the different types of models currently Despite being cost-intense and far less scalable than automation, ## 4 The annotation campaign was carried out in four phases. In phase Figure 3: Corpus Conversion Service annotation user inter- Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % we distributed the annotation workload and performed continuous Phase 1: Data selection and preparation. include publication repositories such as arXiv Preparation work included uploading and parsing the sourced Phase 2: Label selection and guideline. the textual content of an element, which goes beyond visual layout At first sight, the task of visual document-layout interpretation Obviously, this inconsistency in annotations is not desirable for - (1) - (2) - (3) - (4) - (5) - (6) The complete annotation guideline is over 100 pages long and a Phase 3: Training. Figure 4: Examples of plausible annotation alternatives for were carried out over a timeframe of 12 weeks, after which 8 of the Phase 4: Production annotation. Table 2: Prediction performance (mAP@0.5-0.95) of object to avoid this at any cost in order to have clear, unbiased baseline ## 5 The primary goal of DocLayNet is to obtain high-quality ML models Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask paper and leave the detailed evaluation of more recent methods In this section, we will present several aspects related to the ## Baselines for Object Detection In Table 2, we present baseline experiments (given in mAP) on Mask Table 3: Performance of a Mask R-CNN R50 network in ## Learning Curve One of the fundamental questions related to any dataset is if it is ## Impact of Class Labels The choice and number of labels can have a significant effect on Table 4: Performance of a Mask R-CNN R50 network with lists in PubLayNet (grouped list-items) versus DocLayNet (separate ## Impact of Document Split in Train and Test Set Many documents in DocLayNet have a unique styling. In order ## Dataset Comparison Throughout this paper, we claim that DocLayNet's wider variety of Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask Section-header For comparison of DocBank with DocLayNet, we trained only ## Example Predictions To conclude this section, we illustrate the quality of layout predic- ## 6 In this paper, we presented the DocLayNet dataset. It provides the From the dataset, we have derived on the one hand reference To date, there is still a significant gap between human and ML ## REFERENCES - [1] - [2] - [3] - [4] - [5] - [6] - [7] - [8] - [9] - [10] - [11] - [12] - [13] Text Figure 6: Example layout predictions on selected pages from the DocLayNet test-set. (A, D) exhibit favourable results on Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ul- - [20] - [21] - [22] - [23] - [14] - [15] - [16] - [17] - [18] - [19]