feat(html): Support formatting tags in HTML texts (#2111)

* add parsing for formatting tags in HTML backend

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

fix latest tests + wiki_duck result files.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* convert _collect_parent_format_tags to staticmethod

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

---------

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
This commit is contained in:
krrome
2025-08-22 10:37:34 +02:00
committed by GitHub
parent e76298c40d
commit 94fcc46aa9
15 changed files with 9420 additions and 4456 deletions

View File

@@ -27,6 +27,6 @@ HTML
Docling has three backends for parsing HTML files:
1. HTMLDocumentBackend Ignores images
2. HTMLDocumentBackendImagesInline Extracts images inline
3. HTMLDocumentBackendImagesReferenced Extracts images as references
1. **HTMLDocumentBackend** Ignores images
2. **HTMLDocumentBackendImagesInline** Extracts images inline
3. **HTMLDocumentBackendImagesReferenced** Extracts images as references