docling/2206.01062.md at e00f36240521d175656b5a857836e54e794ca062

mirrors/docling

Fork 0

mirror of https://github.com/DS4SD/docling.git synced 2025-07-27 04:24:45 +00:00

Christoph Auer e00f362405

Run Docs CI / build-docs (push) Failing after 1m26s

Details

Run CI / code-checks (push) Failing after 6m37s

Details

Update tests, use TextCell.from_ocr property

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

2025-03-13 16:04:08 +01:00

4.6 KiB

Raw Blame History

DocLayNet: A Large Human-Annotated Dataset for

Birgit Pfitzmann

Christoph Auer

Michele Dolfi

Ahmed S. Nassar

ABSTRACT

Accurate document layout analysis is a key requirement for high-

CCS CONCEPTS

•

Permission to make digital or hard copies of part or all of this work for personal or https://doi.org/10.1145/3534678.3539043

Peter Staar

Figure 1:

KEYWORDS

PDF document conversion, layout segmentation, object-detection,

ACMReference Format:

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter

1

Despite the substantial improvements achieved with machine-learning

Akeyproblem in the process of document conversion is to under-

In this paper, we present the DocLayNet dataset. It provides page-

This enables experimentation with annotation uncertainty

All aspects outlined above are detailed in Section 3. In Section 4,

In Section 5, we will present baseline accuracy numbers for a

2

While early approaches in document-layout analysis used rule-

Lately, new types of ML models for document-layout analysis

3

DocLayNet contains 80863 PDF pages. Among these, 7059 carry two

In addition to open intellectual property constraints for the

Figure 2: Distribution of DocLayNet pages across document

to a minimum, since they introduce difficulties in annotation (see

The pages in DocLayNet can be grouped into six distinct cate-

We did not control the document selection with regard to lan-

To ensure that future benchmarks in the document-layout analy-

Table 1 shows the overall frequency and distribution of the labels

In order to accommodate the different types of models currently

Despite being cost-intense and far less scalable than automation,

4

The annotation campaign was carried out in four phases. In phase

Figure 3: Corpus Conversion Service annotation user inter-

Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as %

we distributed the annotation workload and performed continuous

Phase 1: Data selection and preparation.

include publication repositories such as arXiv

Preparation work included uploading and parsing the sourced

Phase 2: Label selection and guideline.

the textual content of an element, which goes beyond visual layout

At first sight, the task of visual document-layout interpretation

Obviously, this inconsistency in annotations is not desirable for

The complete annotation guideline is over 100 pages long and a

Phase 3: Training.

Figure 4: Examples of plausible annotation alternatives for

were carried out over a timeframe of 12 weeks, after which 8 of the

Phase 4: Production annotation.

Table 2: Prediction performance (mAP@0.5-0.95) of object

to avoid this at any cost in order to have clear, unbiased baseline

5

The primary goal of DocLayNet is to obtain high-quality ML models

Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask

paper and leave the detailed evaluation of more recent methods

In this section, we will present several aspects related to the

Baselines for Object Detection

In Table 2, we present baseline experiments (given in mAP) on Mask

Table 3: Performance of a Mask R-CNN R50 network in

Learning Curve

One of the fundamental questions related to any dataset is if it is

Impact of Class Labels

The choice and number of labels can have a significant effect on

Table 4: Performance of a Mask R-CNN R50 network with

lists in PubLayNet (grouped list-items) versus DocLayNet (separate

Impact of Document Split in Train and Test Set

Many documents in DocLayNet have a unique styling. In order

Dataset Comparison

Throughout this paper, we claim that DocLayNet's wider variety of

Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask

Section-header

For comparison of DocBank with DocLayNet, we trained only

Example Predictions

To conclude this section, we illustrate the quality of layout predic-

6

In this paper, we presented the DocLayNet dataset. It provides the

From the dataset, we have derived on the one hand reference

To date, there is still a significant gap between human and ML

REFERENCES

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]

Text

Figure 6: Example layout predictions on selected pages from the DocLayNet test-set. (A, D) exhibit favourable results on

Diaconu, Mai Thanh Minh, Marc, albinxavi, fatih, oleg, and wanghao yang. ul-

[20]
[21]
[22]
[23]
[14]
[15]
[16]
[17]
[18]
[19]

4.6 KiB Raw Blame History