refactor(HTML): handle text from styled html (#1960)

* A new HTML backend that handles styled html (ignors it) as well as images.

Images are parsed as placeholders with a caption, if it exists.

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com>
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>

* tests(HTML): re-enable test_ordered_lists

Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
This commit is contained in:
Cesar Berrospi Ramis
2025-07-22 13:16:31 +02:00
committed by GitHub
parent 5d98bcea1b
commit a069b1175b
15 changed files with 3241 additions and 2183 deletions

View File

@@ -0,0 +1,32 @@
# Introduction to parsing HTML files with Docling
Docling
<!-- image -->
Docling simplifies document processing, parsing diverse formats — including HTML — and providing seamless integrations with the gen AI ecosystem.
## Supported file formats
Docling supports multiple file formats..
- Advanced PDF understanding
PDF
<!-- image -->
- Microsoft Office DOCX
DOCX
<!-- image -->
- HTML files (with optional support for images)
HTML
<!-- image -->
### Three backends for handling HTML files
Docling has three backends for parsing HTML files:
1. HTMLDocumentBackend Ignores images
2. HTMLDocumentBackendImagesInline Extracts images inline
3. HTMLDocumentBackendImagesReferenced Extracts images as references