mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-29 21:44:32 +00:00
224 lines
4.6 KiB
Markdown
224 lines
4.6 KiB
Markdown
## DocLayNet: A Large Human-Annotated Dataset for
|
|
|
|
Birgit Pfitzmann
|
|
|
|
Christoph Auer
|
|
|
|
Michele Dolfi
|
|
|
|
Ahmed S. Nassar
|
|
|
|
## ABSTRACT
|
|
|
|
Accurate document layout analysis is a key requirement for high-
|
|
|
|
## CCS CONCEPTS
|
|
|
|
•
|
|
|
|
Permission to make digital or hard copies of part or all of this work for personal or https://doi.org/10.1145/3534678.3539043
|
|
|
|
Peter Staar
|
|
|
|
Figure 1:
|
|
|
|
<!-- image -->
|
|
|
|
## KEYWORDS
|
|
|
|
PDF document conversion, layout segmentation, object-detection,
|
|
|
|
## ACMReference Format:
|
|
|
|
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter
|
|
|
|
## 1
|
|
|
|
Despite the substantial improvements achieved with machine-learning
|
|
|
|
Akeyproblem in the process of document conversion is to under-
|
|
|
|
In this paper, we present the DocLayNet dataset. It provides page-
|
|
|
|
- (1)
|
|
- (2)
|
|
- (3)
|
|
- (4)
|
|
|
|
This enables experimentation with annotation uncertainty
|
|
|
|
- (5)
|
|
|
|
All aspects outlined above are detailed in Section 3. In Section 4,
|
|
|
|
In Section 5, we will present baseline accuracy numbers for a
|
|
|
|
## 2
|
|
|
|
While early approaches in document-layout analysis used rule-
|
|
|
|
Lately, new types of ML models for document-layout analysis
|
|
|
|
## 3
|
|
|
|
DocLayNet contains 80863 PDF pages. Among these, 7059 carry two
|
|
|
|
In addition to open intellectual property constraints for the
|
|
|
|
Figure 2: Distribution of DocLayNet pages across document
|
|
|
|
<!-- image -->
|
|
|
|
to a minimum, since they introduce difficulties in annotation (see
|
|
|
|
The pages in DocLayNet can be grouped into six distinct cate-
|
|
|
|
We did not control the document selection with regard to lan-
|
|
|
|
To ensure that future benchmarks in the document-layout analy-
|
|
|
|
Table 1 shows the overall frequency and distribution of the labels
|
|
|
|
In order to accommodate the different types of models currently
|
|
|
|
Despite being cost-intense and far less scalable than automation,
|
|
|
|
## 4
|
|
|
|
The annotation campaign was carried out in four phases. In phase
|
|
|
|
Figure 3: Corpus Conversion Service annotation user inter-
|
|
|
|
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as %
|
|
|
|
<!-- image -->
|
|
|
|
we distributed the annotation workload and performed continuous
|
|
|
|
Phase 1: Data selection and preparation.
|
|
|
|
include publication repositories such as arXiv
|
|
|
|
Preparation work included uploading and parsing the sourced
|
|
|
|
Phase 2: Label selection and guideline.
|
|
|
|
the textual content of an element, which goes beyond visual layout
|
|
|
|
At first sight, the task of visual document-layout interpretation
|
|
|
|
Obviously, this inconsistency in annotations is not desirable for
|
|
|
|
- (1)
|
|
- (2)
|
|
- (3)
|
|
- (4)
|
|
- (5)
|
|
- (6)
|
|
|
|
The complete annotation guideline is over 100 pages long and a
|
|
|
|
Phase 3: Training.
|
|
|
|
Figure 4: Examples of plausible annotation alternatives for
|
|
|
|
<!-- image -->
|
|
|
|
were carried out over a timeframe of 12 weeks, after which 8 of the
|
|
|
|
Phase 4: Production annotation.
|
|
|
|
Table 2: Prediction performance (mAP@0.5-0.95) of object
|
|
|
|
to avoid this at any cost in order to have clear, unbiased baseline
|
|
|
|
## 5
|
|
|
|
The primary goal of DocLayNet is to obtain high-quality ML models
|
|
|
|
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask
|
|
|
|
<!-- image -->
|
|
|
|
paper and leave the detailed evaluation of more recent methods
|
|
|
|
In this section, we will present several aspects related to the
|
|
|
|
## Baselines for Object Detection
|
|
|
|
In Table 2, we present baseline experiments (given in mAP) on Mask
|
|
|
|
Table 3: Performance of a Mask R-CNN R50 network in
|
|
|
|
## Learning Curve
|
|
|
|
One of the fundamental questions related to any dataset is if it is
|
|
|
|
## Impact of Class Labels
|
|
|
|
The choice and number of labels can have a significant effect on
|
|
|
|
Table 4: Performance of a Mask R-CNN R50 network with
|
|
|
|
lists in PubLayNet (grouped list-items) versus DocLayNet (separate
|
|
|
|
## Impact of Document Split in Train and Test Set
|
|
|
|
Many documents in DocLayNet have a unique styling. In order
|
|
|
|
## Dataset Comparison
|
|
|
|
Throughout this paper, we claim that DocLayNet's wider variety of
|
|
|
|
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask
|
|
|
|
Section-header
|
|
|
|
For comparison of DocBank with DocLayNet, we trained only
|
|
|
|
## Example Predictions
|
|
|
|
To conclude this section, we illustrate the quality of layout predic-
|
|
|
|
## 6
|
|
|
|
In this paper, we presented the DocLayNet dataset. It provides the
|
|
|
|
From the dataset, we have derived on the one hand reference
|
|
|
|
To date, there is still a significant gap between human and ML
|
|
|
|
## REFERENCES
|
|
|
|
- [1]
|
|
- [2]
|
|
- [3]
|
|
- [4]
|
|
- [5]
|
|
- [6]
|
|
- [7]
|
|
- [8]
|
|
- [9]
|
|
- [10]
|
|
- [11]
|
|
- [12]
|
|
- [13]
|
|
|
|
Text
|
|
|
|
<!-- image -->
|
|
|
|
Figure 6: Example layout predictions on selected pages from the DocLayNet test-set. (A, D) exhibit favourable results on
|
|
|
|
Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ul-
|
|
|
|
- [20]
|
|
- [21]
|
|
- [22]
|
|
- [23]
|
|
- [14]
|
|
- [15]
|
|
- [16]
|
|
- [17]
|
|
- [18]
|
|
- [19] |