Introduction to parsing HTML files with
Docling
Docling simplifies document processing, parsing diverse formats — including HTML — and providing seamless integrations with the gen AI ecosystem.
Supported file formats
Docling supports multiple file formats..
Advanced PDF understanding
Microsoft Office DOCX
HTML files (with optional support for images)
Three backends for handling HTML files
Docling has three backends for parsing HTML files:
- HTMLDocumentBackend Ignores images
- HTMLDocumentBackendImagesInline Extracts images inline
- HTMLDocumentBackendImagesReferenced Extracts images as references