mirror of https://github.com/DS4SD/docling.git synced 2025-07-26 03:55:00 +00:00

Alexander Vaagan 713d7a3342 A new HTML backend that handles styled html (ignors it) as well as images.

Images are parsed as placeholders with a caption, if it exists.

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com>
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>

2025-07-21 13:29:14 +02:00

720 B

Vendored

Raw Blame History

Introduction to parsing HTML files with Docling

Docling

Docling simplifies document processing, parsing diverse formats — including HTML — and providing seamless integrations with the gen AI ecosystem.

Supported file formats

Docling supports multiple file formats..

Advanced PDF understanding PDF

Microsoft Office DOCX DOCX

HTML files (with optional support for images) HTML

Three backends for handling HTML files

Docling has three backends for parsing HTML files:

HTMLDocumentBackend Ignores images
HTMLDocumentBackendImagesInline Extracts images inline
HTMLDocumentBackendImagesReferenced Extracts images as references

720 B Vendored Raw Blame History

Introduction to parsing HTML files with Docling

Supported file formats

Three backends for handling HTML files

720 B

Vendored

Raw Blame History