mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 12:48:28 +00:00
* add parsing for formatting tags in HTML backend Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix latest tests + wiki_duck result files. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * convert _collect_parent_format_tags to staticmethod Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
728 B
Vendored
728 B
Vendored
Introduction to parsing HTML files with Docling
Docling
Docling simplifies document processing, parsing diverse formats - including HTML - and providing seamless integrations with the gen AI ecosystem.
Supported file formats
Docling supports multiple file formats..
- Advanced PDF understanding PDF
- Microsoft Office DOCX DOCX
- HTML files (with optional support for images) HTML
Three backends for handling HTML files
Docling has three backends for parsing HTML files:
- HTMLDocumentBackend Ignores images
- HTMLDocumentBackendImagesInline Extracts images inline
- HTMLDocumentBackendImagesReferenced Extracts images as references