4.6 KiB
DocLayNet: A Large Human-Annotated Dataset for
Birgit Pfitzmann
Christoph Auer
Michele Dolfi
Ahmed S. Nassar
ABSTRACT
Accurate document layout analysis is a key requirement for high-
CCS CONCEPTS
•
Permission to make digital or hard copies of part or all of this work for personal or https://doi.org/10.1145/3534678.3539043
Peter Staar
Figure 1:
KEYWORDS
PDF document conversion, layout segmentation, object-detection,
ACMReference Format:
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter
1
Despite the substantial improvements achieved with machine-learning
Akeyproblem in the process of document conversion is to under-
In this paper, we present the DocLayNet dataset. It provides page-
-
(1)
-
(2)
-
(3)
-
(4)
This enables experimentation with annotation uncertainty
- (5)
All aspects outlined above are detailed in Section 3. In Section 4,
In Section 5, we will present baseline accuracy numbers for a
2
While early approaches in document-layout analysis used rule-
Lately, new types of ML models for document-layout analysis
3
DocLayNet contains 80863 PDF pages. Among these, 7059 carry two
In addition to open intellectual property constraints for the
Figure 2: Distribution of DocLayNet pages across document
to a minimum, since they introduce difficulties in annotation (see
The pages in DocLayNet can be grouped into six distinct cate-
We did not control the document selection with regard to lan-
To ensure that future benchmarks in the document-layout analy-
Table 1 shows the overall frequency and distribution of the labels
In order to accommodate the different types of models currently
Despite being cost-intense and far less scalable than automation,
4
The annotation campaign was carried out in four phases. In phase
Figure 3: Corpus Conversion Service annotation user inter-
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as %
we distributed the annotation workload and performed continuous
Phase 1: Data selection and preparation.
include publication repositories such as arXiv
Preparation work included uploading and parsing the sourced
Phase 2: Label selection and guideline.
the textual content of an element, which goes beyond visual layout
At first sight, the task of visual document-layout interpretation
Obviously, this inconsistency in annotations is not desirable for
-
(1)
-
(2)
-
(3)
-
(4)
-
(5)
-
(6)
The complete annotation guideline is over 100 pages long and a
Phase 3: Training.
Figure 4: Examples of plausible annotation alternatives for
were carried out over a timeframe of 12 weeks, after which 8 of the
Phase 4: Production annotation.
Table 2: Prediction performance (mAP@0.5-0.95) of object
to avoid this at any cost in order to have clear, unbiased baseline
5
The primary goal of DocLayNet is to obtain high-quality ML models
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask
paper and leave the detailed evaluation of more recent methods
In this section, we will present several aspects related to the
Baselines for Object Detection
In Table 2, we present baseline experiments (given in mAP) on Mask
Table 3: Performance of a Mask R-CNN R50 network in
Learning Curve
One of the fundamental questions related to any dataset is if it is
Impact of Class Labels
The choice and number of labels can have a significant effect on
Table 4: Performance of a Mask R-CNN R50 network with
lists in PubLayNet (grouped list-items) versus DocLayNet (separate
Impact of Document Split in Train and Test Set
Many documents in DocLayNet have a unique styling. In order
Dataset Comparison
Throughout this paper, we claim that DocLayNet's wider variety of
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask
Section-header
For comparison of DocBank with DocLayNet, we trained only
Example Predictions
To conclude this section, we illustrate the quality of layout predic-
6
In this paper, we presented the DocLayNet dataset. It provides the
From the dataset, we have derived on the one hand reference
To date, there is still a significant gap between human and ML
REFERENCES
-
[1]
-
[2]
-
[3]
-
[4]
-
[5]
-
[6]
-
[7]
-
[8]
-
[9]
-
[10]
-
[11]
-
[12]
-
[13]
Text
Figure 6: Example layout predictions on selected pages from the DocLayNet test-set. (A, D) exhibit favourable results on
Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ul-
-
[20]
-
[21]
-
[22]
-
[23]
-
[14]
-
[15]
-
[16]
-
[17]
-
[18]
-
[19]