docling/tests/data/groundtruth/docling_v2/2206.01062.md

## DocLayNet: A Large Human-Annotated Dataset for

Birgit Pfitzmann

Christoph Auer

Michele Dolfi

Ahmed S. Nassar

## ABSTRACT

Accurate document layout analysis is a key requirement for high-

## CCS CONCEPTS

•

Permission to make digital or hard copies of part or all of this work for personal or https://doi.org/10.1145/3534678.3539043

Peter Staar

Figure 1:

<!-- image -->

## KEYWORDS

PDF document conversion, layout segmentation, object-detection,

## ACMReference Format:

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter

## 1

Despite the substantial improvements achieved with machine-learning

Akeyproblem in the process of document conversion is to under-

In this paper, we present the DocLayNet dataset. It provides page-

- (1)
- (2)
- (3)
- (4)

This enables experimentation with annotation uncertainty

- (5)

All aspects outlined above are detailed in Section 3. In Section 4,

In Section 5, we will present baseline accuracy numbers for a

## 2

While early approaches in document-layout analysis used rule-

Lately, new types of ML models for document-layout analysis

## 3

DocLayNet contains 80863 PDF pages. Among these, 7059 carry two

In addition to open intellectual property constraints for the

Figure 2: Distribution of DocLayNet pages across document

<!-- image -->

to a minimum, since they introduce difficulties in annotation (see

The pages in DocLayNet can be grouped into six distinct cate-

We did not control the document selection with regard to lan-

To ensure that future benchmarks in the document-layout analy-

Table 1 shows the overall frequency and distribution of the labels

In order to accommodate the different types of models currently

Despite being cost-intense and far less scalable than automation,

## 4

The annotation campaign was carried out in four phases. In phase

Figure 3: Corpus Conversion Service annotation user inter-

Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as %

<!-- image -->

we distributed the annotation workload and performed continuous

Phase 1: Data selection and preparation.

include publication repositories such as arXiv

Preparation work included uploading and parsing the sourced

Phase 2: Label selection and guideline.

the textual content of an element, which goes beyond visual layout

At first sight, the task of visual document-layout interpretation

Obviously, this inconsistency in annotations is not desirable for

- (1)
- (2)
- (3)
- (4)
- (5)
- (6)

The complete annotation guideline is over 100 pages long and a

Phase 3: Training.

Figure 4: Examples of plausible annotation alternatives for

<!-- image -->

were carried out over a timeframe of 12 weeks, after which 8 of the

Phase 4: Production annotation.

Table 2: Prediction performance (mAP@0.5-0.95) of object

to avoid this at any cost in order to have clear, unbiased baseline

## 5

The primary goal of DocLayNet is to obtain high-quality ML models

Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask

<!-- image -->

paper and leave the detailed evaluation of more recent methods

In this section, we will present several aspects related to the

## Baselines for Object Detection

In Table 2, we present baseline experiments (given in mAP) on Mask

Table 3: Performance of a Mask R-CNN R50 network in

## Learning Curve

One of the fundamental questions related to any dataset is if it is

## Impact of Class Labels

The choice and number of labels can have a significant effect on

Table 4: Performance of a Mask R-CNN R50 network with

lists in PubLayNet (grouped list-items) versus DocLayNet (separate

## Impact of Document Split in Train and Test Set

Many documents in DocLayNet have a unique styling. In order

## Dataset Comparison

Throughout this paper, we claim that DocLayNet's wider variety of

Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask

Section-header

For comparison of DocBank with DocLayNet, we trained only

## Example Predictions

To conclude this section, we illustrate the quality of layout predic-

## 6

In this paper, we presented the DocLayNet dataset. It provides the

From the dataset, we have derived on the one hand reference

To date, there is still a significant gap between human and ML

## REFERENCES

- [1]
- [2]
- [3]
- [4]
- [5]
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
- [13]

Text

<!-- image -->

Figure 6: Example layout predictions on selected pages from the DocLayNet test-set. (A, D) exhibit favourable results on

Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ul-

- [20]
- [21]
- [22]
- [23]
- [14]
- [15]
- [16]
- [17]
- [18]
- [19]