## DocLayNet: A Large Human-Annotated Dataset for

Birgit Pfitzmann

Christoph Auer

Michele Dolfi

Ahmed S. Nassar

## ABSTRACT

Accurate document layout analysis is a key requirement for high-

## CCS CONCEPTS

•

Permission to make digital or hard copies of part or all of this work for personal or https://doi.org/10.1145/3534678.3539043

Peter Staar

Figure 1:
<!-- image -->

## KEYWORDS

PDF document conversion, layout segmentation, object-detection,

## ACMReference Format:

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter

## 1

Despite the substantial improvements achieved with machine-learning

Akeyproblem in the process of document conversion is to under-

In this paper, we present the DocLayNet dataset. It provides page-

- (1)

- (2)

- (3)

- (4)

This enables experimentation with annotation uncertainty

- (5)

All aspects outlined above are detailed in Section 3. In Section 4,

In Section 5, we will present baseline accuracy numbers for a

## 2

While early approaches in document-layout analysis used rule-

Lately, new types of ML models for document-layout analysis

## 3

DocLayNet contains 80863 PDF pages. Among these, 7059 carry two

In addition to open intellectual property constraints for the

Figure 2: Distribution of DocLayNet pages across document
<!-- image -->

to a minimum, since they introduce difficulties in annotation (see

The pages in DocLayNet can be grouped into six distinct cate-

We did not control the document selection with regard to lan-

To ensure that future benchmarks in the document-layout analy-

Table 1 shows the overall frequency and distribution of the labels

In order to accommodate the different types of models currently

Despite being cost-intense and far less scalable than automation,

## 4

The annotation campaign was carried out in four phases. In phase

Figure 3: Corpus Conversion Service annotation user inter-


Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as %
<!-- image -->

we distributed the annotation workload and performed continuous

Phase 1: Data selection and preparation.

include publication repositories such as arXiv

Preparation work included uploading and parsing the sourced

Phase 2: Label selection and guideline.

the textual content of an element, which goes beyond visual layout

At first sight, the task of visual document-layout interpretation

Obviously, this inconsistency in annotations is not desirable for

- (1)

- (2)

- (3)

- (4)

- (5)

- (6)

The complete annotation guideline is over 100 pages long and a

Phase 3: Training.

Figure 4: Examples of plausible annotation alternatives for
<!-- image -->

were carried out over a timeframe of 12 weeks, after which 8 of the

Phase 4: Production annotation.

Table 2: Prediction performance (mAP@0.5-0.95) of object

to avoid this at any cost in order to have clear, unbiased baseline

## 5

The primary goal of DocLayNet is to obtain high-quality ML models

Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask
<!-- image -->

paper and leave the detailed evaluation of more recent methods

In this section, we will present several aspects related to the

## Baselines for Object Detection

In Table 2, we present baseline experiments (given in mAP) on Mask

Table 3: Performance of a Mask R-CNN R50 network in

## Learning Curve

One of the fundamental questions related to any dataset is if it is

## Impact of Class Labels

The choice and number of labels can have a significant effect on

Table 4: Performance of a Mask R-CNN R50 network with

lists in PubLayNet (grouped list-items) versus DocLayNet (separate

## Impact of Document Split in Train and Test Set

Many documents in DocLayNet have a unique styling. In order

## Dataset Comparison

Throughout this paper, we claim that DocLayNet's wider variety of

Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask

Section-header

For comparison of DocBank with DocLayNet, we trained only

## Example Predictions

To conclude this section, we illustrate the quality of layout predic-

## 6

In this paper, we presented the DocLayNet dataset. It provides the

From the dataset, we have derived on the one hand reference

To date, there is still a significant gap between human and ML

## REFERENCES

- [1]

- [2]

- [3]

- [4]

- [5]

- [6]

- [7]

- [8]

- [9]

- [10]

- [11]

- [12]

- [13]

Text
<!-- image -->

Figure 6: Example layout predictions on selected pages from the DocLayNet test-set. (A, D) exhibit favourable results on

Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ul-

- [20]

- [21]

- [22]

- [23]

- [14]

- [15]

- [16]

- [17]

- [18]

- [19]