mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-29 21:44:32 +00:00
chore(HTML): replace non-standard Unicode characters for beter downstream tasks Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
716 B
Vendored
716 B
Vendored
Introduction to parsing HTML files with Docling
Docling
Docling simplifies document processing, parsing diverse formats - including HTML - and providing seamless integrations with the gen AI ecosystem.
Supported file formats
Docling supports multiple file formats..
- Advanced PDF understanding PDF
- Microsoft Office DOCX DOCX
- HTML files (with optional support for images) HTML
Three backends for handling HTML files
Docling has three backends for parsing HTML files:
- HTMLDocumentBackend Ignores images
- HTMLDocumentBackendImagesInline Extracts images inline
- HTMLDocumentBackendImagesReferenced Extracts images as references