refactor(HTML): handle text from styled html (#1960)

* A new HTML backend that handles styled html (ignors it) as well as images. Images are parsed as placeholders with a caption, if it exists. Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com> Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> * tests(HTML): re-enable test_ordered_lists Re-enable test_ordered_lists regression test for the HTML backend since docling-core now supports ordered lists with custom start value. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
2025-12-08 20:58:11 +00:00 · 2025-07-22 13:16:31 +02:00
parent 5d98bcea1b
commit a069b1175b
15 changed files with 3241 additions and 2183 deletions
--- a/tests/data/groundtruth/docling_v2/example_09.html.md
+++ b/tests/data/groundtruth/docling_v2/example_09.html.md
@@ -0,0 +1,32 @@
+# Introduction to parsing HTML files with Docling
+
+Docling
+
+<!-- image -->
+
+Docling simplifies document processing, parsing diverse formats — including HTML — and providing seamless integrations with the gen AI ecosystem.
+
+## Supported file formats
+
+Docling supports multiple file formats..
+
+- Advanced PDF understanding
+PDF
+
+<!-- image -->
+- Microsoft Office DOCX
+DOCX
+
+<!-- image -->
+- HTML files (with optional support for images)
+HTML
+
+<!-- image -->
+
+### Three backends for handling HTML files
+
+Docling has three backends for parsing HTML files:
+
+1. HTMLDocumentBackend Ignores images
+2. HTMLDocumentBackendImagesInline Extracts images inline
+3. HTMLDocumentBackendImagesReferenced Extracts images as references