docling/tests/data/groundtruth/docling_v2/2206.01062.md
Christoph Auer 1b9fcf0edf Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-11 16:31:42 +01:00

224 lines
4.6 KiB
Markdown

## DocLayNet: A Large Human-Annotated Dataset for
Birgit Pfitzmann
Christoph Auer
Michele Dolfi
Ahmed S. Nassar
## ABSTRACT
Accurate document layout analysis is a key requirement for high-
## CCS CONCEPTS
Permission to make digital or hard copies of part or all of this work for personal or https://doi.org/10.1145/3534678.3539043
Peter Staar
Figure 1:
<!-- image -->
## KEYWORDS
PDF document conversion, layout segmentation, object-detection,
## ACMReference Format:
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter
## 1
Despite the substantial improvements achieved with machine-learning
Akeyproblem in the process of document conversion is to under-
In this paper, we present the DocLayNet dataset. It provides page-
- (1)
- (2)
- (3)
- (4)
This enables experimentation with annotation uncertainty
- (5)
All aspects outlined above are detailed in Section 3. In Section 4,
In Section 5, we will present baseline accuracy numbers for a
## 2
While early approaches in document-layout analysis used rule-
Lately, new types of ML models for document-layout analysis
## 3
DocLayNet contains 80863 PDF pages. Among these, 7059 carry two
In addition to open intellectual property constraints for the
Figure 2: Distribution of DocLayNet pages across document
<!-- image -->
to a minimum, since they introduce difficulties in annotation (see
The pages in DocLayNet can be grouped into six distinct cate-
We did not control the document selection with regard to lan-
To ensure that future benchmarks in the document-layout analy-
Table 1 shows the overall frequency and distribution of the labels
In order to accommodate the different types of models currently
Despite being cost-intense and far less scalable than automation,
## 4
The annotation campaign was carried out in four phases. In phase
Figure 3: Corpus Conversion Service annotation user inter-
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as %
<!-- image -->
we distributed the annotation workload and performed continuous
Phase 1: Data selection and preparation.
include publication repositories such as arXiv
Preparation work included uploading and parsing the sourced
Phase 2: Label selection and guideline.
the textual content of an element, which goes beyond visual layout
At first sight, the task of visual document-layout interpretation
Obviously, this inconsistency in annotations is not desirable for
- (1)
- (2)
- (3)
- (4)
- (5)
- (6)
The complete annotation guideline is over 100 pages long and a
Phase 3: Training.
Figure 4: Examples of plausible annotation alternatives for
<!-- image -->
were carried out over a timeframe of 12 weeks, after which 8 of the
Phase 4: Production annotation.
Table 2: Prediction performance (mAP@0.5-0.95) of object
to avoid this at any cost in order to have clear, unbiased baseline
## 5
The primary goal of DocLayNet is to obtain high-quality ML models
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask
<!-- image -->
paper and leave the detailed evaluation of more recent methods
In this section, we will present several aspects related to the
## Baselines for Object Detection
In Table 2, we present baseline experiments (given in mAP) on Mask
Table 3: Performance of a Mask R-CNN R50 network in
## Learning Curve
One of the fundamental questions related to any dataset is if it is
## Impact of Class Labels
The choice and number of labels can have a significant effect on
Table 4: Performance of a Mask R-CNN R50 network with
lists in PubLayNet (grouped list-items) versus DocLayNet (separate
## Impact of Document Split in Train and Test Set
Many documents in DocLayNet have a unique styling. In order
## Dataset Comparison
Throughout this paper, we claim that DocLayNet's wider variety of
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask
Section-header
For comparison of DocBank with DocLayNet, we trained only
## Example Predictions
To conclude this section, we illustrate the quality of layout predic-
## 6
In this paper, we presented the DocLayNet dataset. It provides the
From the dataset, we have derived on the one hand reference
To date, there is still a significant gap between human and ML
## REFERENCES
- [1]
- [2]
- [3]
- [4]
- [5]
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
- [13]
Text
<!-- image -->
Figure 6: Example layout predictions on selected pages from the DocLayNet test-set. (A, D) exhibit favourable results on
Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ul-
- [20]
- [21]
- [22]
- [23]
- [14]
- [15]
- [16]
- [17]
- [18]
- [19]