mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
refactor(HTML): handle text from styled html (#1960)
* A new HTML backend that handles styled html (ignors it) as well as images. Images are parsed as placeholders with a caption, if it exists. Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com> Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> * tests(HTML): re-enable test_ordered_lists Re-enable test_ordered_lists regression test for the HTML backend since docling-core now supports ordered lists with custom start value. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
5d98bcea1b
commit
a069b1175b
32
tests/data/groundtruth/docling_v2/example_09.html.md
vendored
Normal file
32
tests/data/groundtruth/docling_v2/example_09.html.md
vendored
Normal file
@@ -0,0 +1,32 @@
|
||||
# Introduction to parsing HTML files with Docling
|
||||
|
||||
Docling
|
||||
|
||||
<!-- image -->
|
||||
|
||||
Docling simplifies document processing, parsing diverse formats — including HTML — and providing seamless integrations with the gen AI ecosystem.
|
||||
|
||||
## Supported file formats
|
||||
|
||||
Docling supports multiple file formats..
|
||||
|
||||
- Advanced PDF understanding
|
||||
PDF
|
||||
|
||||
<!-- image -->
|
||||
- Microsoft Office DOCX
|
||||
DOCX
|
||||
|
||||
<!-- image -->
|
||||
- HTML files (with optional support for images)
|
||||
HTML
|
||||
|
||||
<!-- image -->
|
||||
|
||||
### Three backends for handling HTML files
|
||||
|
||||
Docling has three backends for parsing HTML files:
|
||||
|
||||
1. HTMLDocumentBackend Ignores images
|
||||
2. HTMLDocumentBackendImagesInline Extracts images inline
|
||||
3. HTMLDocumentBackendImagesReferenced Extracts images as references
|
||||
Reference in New Issue
Block a user