feat(backend): add generic options support and HTML image handling modes (#2011)

* feat: add backend options support to document backends

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat: enhance document backends with generic backend options and improve HTML image handling

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* Refactor tests for declarativebackend

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): improve image caption handling and ensure backend options are set correctly

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix: enhance HTML backend image handling and add support for local file paths

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: Add ground truth data for test data

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): skip loading SVG files in image data handling

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): simplify backend options and address gaps

Backend options for DeclarativeDocumentBackend classes and only when necessary.
Refactor caption parsing in 'img' elements and remove dummy text.
Replace deprecated annotations from Typing library with native types.
Replace typing annotations according to pydantic guidelines.
Some documentation with pydantic annotations.
Fix diff issue with test files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(html): add tests and fix bugs

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): refactor backend options

Move backend option classes to its own module within datamodel package.
Rename 'source_location' with 'source_uri' in HTMLBackendOptions.
Rename 'image_fetch' with 'fetch_images' in HTMLBackendOptions.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(markdown): create a class for the markdown backend options

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
Legoshi
2025-10-21 12:52:17 +02:00
committed by GitHub
parent b66624bfff
commit a30e6a7614
31 changed files with 8088 additions and 7588 deletions

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,20 @@
# Introduction
This is the first paragraph of the introduction.
## Background
Some background information here.
Example image
<!-- image -->
- First item in unordered list
- Second item in unordered list
1. First item in ordered list
2. Second item in ordered list
42. First item in ordered list with start
43. Second item in ordered list with start

View File

@@ -1,36 +0,0 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: title: Introduction to parsing HTML files with Docling
item-2 at level 2: picture
item-2 at level 3: caption: Docling
item-3 at level 2: text: Docling simplifies document proc ... ntegrations with the gen AI ecosystem.
item-4 at level 2: section_header: Supported file formats
item-5 at level 3: text: Docling supports multiple file formats..
item-6 at level 3: list: group list
item-7 at level 4: list_item: Advanced PDF understanding
item-8 at level 4: picture
item-8 at level 5: caption: PDF
item-9 at level 4: list_item: Microsoft Office DOCX
item-10 at level 4: picture
item-10 at level 5: caption: DOCX
item-11 at level 4: list_item: HTML files (with optional support for images)
item-12 at level 4: picture
item-12 at level 5: caption: HTML
item-13 at level 3: section_header: Three backends for handling HTML files
item-14 at level 4: text: Docling has three backends for parsing HTML files:
item-15 at level 4: list: group ordered list
item-16 at level 5: list_item:
item-17 at level 6: inline: group group
item-18 at level 7: text: HTMLDocumentBackend
item-19 at level 7: text: Ignores images
item-20 at level 5: list_item:
item-21 at level 6: inline: group group
item-22 at level 7: text: HTMLDocumentBackendImagesInline
item-23 at level 7: text: Extracts images inline
item-24 at level 5: list_item:
item-25 at level 6: inline: group group
item-26 at level 7: text: HTMLDocumentBackendImagesReferenced
item-27 at level 7: text: Extracts images as references
item-28 at level 1: caption: Docling
item-29 at level 1: caption: PDF
item-30 at level 1: caption: DOCX
item-31 at level 1: caption: HTML

View File

@@ -1,560 +0,0 @@
{
"schema_name": "DoclingDocument",
"version": "1.7.0",
"name": "example_09",
"origin": {
"mimetype": "text/html",
"binary_hash": 6785336133244366107,
"filename": "example_09.html"
},
"furniture": {
"self_ref": "#/furniture",
"children": [],
"content_layer": "furniture",
"name": "_root_",
"label": "unspecified"
},
"body": {
"self_ref": "#/body",
"children": [
{
"$ref": "#/texts/0"
},
{
"$ref": "#/texts/1"
},
{
"$ref": "#/texts/6"
},
{
"$ref": "#/texts/8"
},
{
"$ref": "#/texts/10"
}
],
"content_layer": "body",
"name": "_root_",
"label": "unspecified"
},
"groups": [
{
"self_ref": "#/groups/0",
"parent": {
"$ref": "#/texts/3"
},
"children": [
{
"$ref": "#/texts/5"
},
{
"$ref": "#/pictures/1"
},
{
"$ref": "#/texts/7"
},
{
"$ref": "#/pictures/2"
},
{
"$ref": "#/texts/9"
},
{
"$ref": "#/pictures/3"
}
],
"content_layer": "body",
"name": "list",
"label": "list"
},
{
"self_ref": "#/groups/1",
"parent": {
"$ref": "#/texts/11"
},
"children": [
{
"$ref": "#/texts/13"
},
{
"$ref": "#/texts/16"
},
{
"$ref": "#/texts/19"
}
],
"content_layer": "body",
"name": "ordered list",
"label": "list"
},
{
"self_ref": "#/groups/2",
"parent": {
"$ref": "#/texts/13"
},
"children": [
{
"$ref": "#/texts/14"
},
{
"$ref": "#/texts/15"
}
],
"content_layer": "body",
"name": "group",
"label": "inline"
},
{
"self_ref": "#/groups/3",
"parent": {
"$ref": "#/texts/16"
},
"children": [
{
"$ref": "#/texts/17"
},
{
"$ref": "#/texts/18"
}
],
"content_layer": "body",
"name": "group",
"label": "inline"
},
{
"self_ref": "#/groups/4",
"parent": {
"$ref": "#/texts/19"
},
"children": [
{
"$ref": "#/texts/20"
},
{
"$ref": "#/texts/21"
}
],
"content_layer": "body",
"name": "group",
"label": "inline"
}
],
"texts": [
{
"self_ref": "#/texts/0",
"parent": {
"$ref": "#/body"
},
"children": [
{
"$ref": "#/pictures/0"
},
{
"$ref": "#/texts/2"
},
{
"$ref": "#/texts/3"
}
],
"content_layer": "body",
"label": "title",
"prov": [],
"orig": "Introduction to parsing HTML files with Docling",
"text": "Introduction to parsing HTML files with Docling"
},
{
"self_ref": "#/texts/1",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "caption",
"prov": [],
"orig": "Docling",
"text": "Docling"
},
{
"self_ref": "#/texts/2",
"parent": {
"$ref": "#/texts/0"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "Docling simplifies document processing, parsing diverse formats - including HTML - and providing seamless integrations with the gen AI ecosystem.",
"text": "Docling simplifies document processing, parsing diverse formats - including HTML - and providing seamless integrations with the gen AI ecosystem."
},
{
"self_ref": "#/texts/3",
"parent": {
"$ref": "#/texts/0"
},
"children": [
{
"$ref": "#/texts/4"
},
{
"$ref": "#/groups/0"
},
{
"$ref": "#/texts/11"
}
],
"content_layer": "body",
"label": "section_header",
"prov": [],
"orig": "Supported file formats",
"text": "Supported file formats",
"level": 1
},
{
"self_ref": "#/texts/4",
"parent": {
"$ref": "#/texts/3"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "Docling supports multiple file formats..",
"text": "Docling supports multiple file formats.."
},
{
"self_ref": "#/texts/5",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Advanced PDF understanding",
"text": "Advanced PDF understanding",
"enumerated": false,
"marker": ""
},
{
"self_ref": "#/texts/6",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "caption",
"prov": [],
"orig": "PDF",
"text": "PDF"
},
{
"self_ref": "#/texts/7",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "Microsoft Office DOCX",
"text": "Microsoft Office DOCX",
"enumerated": false,
"marker": ""
},
{
"self_ref": "#/texts/8",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "caption",
"prov": [],
"orig": "DOCX",
"text": "DOCX"
},
{
"self_ref": "#/texts/9",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "HTML files (with optional support for images)",
"text": "HTML files (with optional support for images)",
"enumerated": false,
"marker": ""
},
{
"self_ref": "#/texts/10",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "body",
"label": "caption",
"prov": [],
"orig": "HTML",
"text": "HTML"
},
{
"self_ref": "#/texts/11",
"parent": {
"$ref": "#/texts/3"
},
"children": [
{
"$ref": "#/texts/12"
},
{
"$ref": "#/groups/1"
}
],
"content_layer": "body",
"label": "section_header",
"prov": [],
"orig": "Three backends for handling HTML files",
"text": "Three backends for handling HTML files",
"level": 2
},
{
"self_ref": "#/texts/12",
"parent": {
"$ref": "#/texts/11"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "Docling has three backends for parsing HTML files:",
"text": "Docling has three backends for parsing HTML files:"
},
{
"self_ref": "#/texts/13",
"parent": {
"$ref": "#/groups/1"
},
"children": [
{
"$ref": "#/groups/2"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "",
"text": "",
"enumerated": true,
"marker": ""
},
{
"self_ref": "#/texts/14",
"parent": {
"$ref": "#/groups/2"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "HTMLDocumentBackend",
"text": "HTMLDocumentBackend",
"formatting": {
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/15",
"parent": {
"$ref": "#/groups/2"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "Ignores images",
"text": "Ignores images"
},
{
"self_ref": "#/texts/16",
"parent": {
"$ref": "#/groups/1"
},
"children": [
{
"$ref": "#/groups/3"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "",
"text": "",
"enumerated": true,
"marker": ""
},
{
"self_ref": "#/texts/17",
"parent": {
"$ref": "#/groups/3"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "HTMLDocumentBackendImagesInline",
"text": "HTMLDocumentBackendImagesInline",
"formatting": {
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/18",
"parent": {
"$ref": "#/groups/3"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "Extracts images inline",
"text": "Extracts images inline"
},
{
"self_ref": "#/texts/19",
"parent": {
"$ref": "#/groups/1"
},
"children": [
{
"$ref": "#/groups/4"
}
],
"content_layer": "body",
"label": "list_item",
"prov": [],
"orig": "",
"text": "",
"enumerated": true,
"marker": ""
},
{
"self_ref": "#/texts/20",
"parent": {
"$ref": "#/groups/4"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "HTMLDocumentBackendImagesReferenced",
"text": "HTMLDocumentBackendImagesReferenced",
"formatting": {
"bold": true,
"italic": false,
"underline": false,
"strikethrough": false,
"script": "baseline"
}
},
{
"self_ref": "#/texts/21",
"parent": {
"$ref": "#/groups/4"
},
"children": [],
"content_layer": "body",
"label": "text",
"prov": [],
"orig": "Extracts images as references",
"text": "Extracts images as references"
}
],
"pictures": [
{
"self_ref": "#/pictures/0",
"parent": {
"$ref": "#/texts/0"
},
"children": [],
"content_layer": "body",
"label": "picture",
"prov": [],
"captions": [
{
"$ref": "#/texts/1"
}
],
"references": [],
"footnotes": [],
"annotations": []
},
{
"self_ref": "#/pictures/1",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "picture",
"prov": [],
"captions": [
{
"$ref": "#/texts/6"
}
],
"references": [],
"footnotes": [],
"annotations": []
},
{
"self_ref": "#/pictures/2",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "picture",
"prov": [],
"captions": [
{
"$ref": "#/texts/8"
}
],
"references": [],
"footnotes": [],
"annotations": []
},
{
"self_ref": "#/pictures/3",
"parent": {
"$ref": "#/groups/0"
},
"children": [],
"content_layer": "body",
"label": "picture",
"prov": [],
"captions": [
{
"$ref": "#/texts/10"
}
],
"references": [],
"footnotes": [],
"annotations": []
}
],
"tables": [],
"key_value_items": [],
"form_items": [],
"pages": {}
}

View File

@@ -1,32 +0,0 @@
# Introduction to parsing HTML files with Docling
Docling
<!-- image -->
Docling simplifies document processing, parsing diverse formats - including HTML - and providing seamless integrations with the gen AI ecosystem.
## Supported file formats
Docling supports multiple file formats..
- Advanced PDF understanding
PDF
<!-- image -->
- Microsoft Office DOCX
DOCX
<!-- image -->
- HTML files (with optional support for images)
HTML
<!-- image -->
### Three backends for handling HTML files
Docling has three backends for parsing HTML files:
1. **HTMLDocumentBackend** Ignores images
2. **HTMLDocumentBackendImagesInline** Extracts images inline
3. **HTMLDocumentBackendImagesReferenced** Extracts images as references

View File

@@ -17,6 +17,12 @@
"body": {
"self_ref": "#/body",
"children": [
{
"$ref": "#/texts/0"
},
{
"$ref": "#/pictures/0"
},
{
"$ref": "#/groups/0"
}
@@ -33,7 +39,7 @@
},
"children": [
{
"$ref": "#/texts/0"
"$ref": "#/texts/1"
}
],
"content_layer": "body",
@@ -44,6 +50,18 @@
"texts": [
{
"self_ref": "#/texts/0",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "furniture",
"label": "caption",
"prov": [],
"orig": "Image alt text",
"text": "Image alt text"
},
{
"self_ref": "#/texts/1",
"parent": {
"$ref": "#/groups/0"
},
@@ -57,7 +75,26 @@
"level": 1
}
],
"pictures": [],
"pictures": [
{
"self_ref": "#/pictures/0",
"parent": {
"$ref": "#/body"
},
"children": [],
"content_layer": "furniture",
"label": "picture",
"prov": [],
"captions": [
{
"$ref": "#/texts/0"
}
],
"references": [],
"footnotes": [],
"annotations": []
}
],
"tables": [],
"key_value_items": [],
"form_items": [],

View File

@@ -1,7 +1,7 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: caption: Image Hyperlink.
item-1 at level 1: caption: Clickable Example
item-2 at level 1: picture
item-2 at level 2: caption: Image Hyperlink.
item-2 at level 2: caption: Clickable Example
item-3 at level 1: caption: This is an example caption for the image.
item-4 at level 1: picture
item-4 at level 2: caption: This is an example caption for the image.

View File

@@ -66,8 +66,8 @@
"content_layer": "body",
"label": "caption",
"prov": [],
"orig": "Image Hyperlink.",
"text": "Image Hyperlink.",
"orig": "Clickable Example",
"text": "Clickable Example",
"hyperlink": "https://www.example.com/"
},
{

View File

@@ -1,4 +1,4 @@
Image Hyperlink.
Clickable Example
<!-- image -->

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -4,7 +4,7 @@
<p>This is the first paragraph of the introduction.</p>
<h2>Background</h2>
<p>Some background information here.</p>
<img src="image1.png" alt="Example image"/>
<img src="example_image_01.png" alt="Example image"/>
<ul>
<li>First item in unordered list</li>
<li>Second item in unordered list</li>

View File

@@ -1,21 +0,0 @@
<html>
<body>
<h1>Introduction to parsing HTML files with <img src="https://docling-project.github.io/docling/assets/logo.png" alt="Docling" height="64"> Docling</h1>
<p>Docling simplifies document processing, parsing diverse formats — including HTML — and providing seamless integrations with the gen AI ecosystem.</p>
<h2>Supported file formats</h2>
<p>Docling supports multiple file formats..</p>
<ul>
<li><img src="https://github.com/docling-project/docling/tree/main/docs/assets/pdf.png" height="32" alt="PDF">Advanced PDF understanding</li>
<li><img src="https://github.com/docling-project/docling/tree/main/docs/assets/docx.png" height="32" alt="DOCX">Microsoft Office DOCX</li>
<li><img src="https://github.com/docling-project/docling/tree/main/docs/assets/html.png" height="32" alt="HTML">HTML files (with optional support for images)</li>
</ul>
<h3>Three backends for handling HTML files</h3>
<p>Docling has three backends for parsing HTML files:</p>
<ol>
<li><b>HTMLDocumentBackend</b> Ignores images</li>
<li><b>HTMLDocumentBackendImagesInline</b> Extracts images inline</li>
<li><b>HTMLDocumentBackendImagesReferenced</b> Extracts images as references</li>
</ol>
</body>
</html>

BIN
tests/data/html/example_image_01.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 548 KiB