mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
refactor(HTML): handle text from styled html (#1960)
* A new HTML backend that handles styled html (ignors it) as well as images. Images are parsed as placeholders with a caption, if it exists. Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com> Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> * tests(HTML): re-enable test_ordered_lists Re-enable test_ordered_lists regression test for the HTML backend since docling-core now supports ordered lists with custom start value. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
5d98bcea1b
commit
a069b1175b
@@ -29,11 +29,17 @@
|
||||
{
|
||||
"$ref": "#/texts/3"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/4"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/5"
|
||||
},
|
||||
{
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/6"
|
||||
"$ref": "#/texts/8"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@@ -48,10 +54,10 @@
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"$ref": "#/texts/4"
|
||||
"$ref": "#/texts/6"
|
||||
},
|
||||
{
|
||||
"$ref": "#/texts/5"
|
||||
"$ref": "#/texts/7"
|
||||
}
|
||||
],
|
||||
"content_layer": "body",
|
||||
@@ -66,6 +72,18 @@
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "furniture",
|
||||
"label": "title",
|
||||
"prov": [],
|
||||
"orig": "Sample HTML File",
|
||||
"text": "Sample HTML File"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/1",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
@@ -73,7 +91,7 @@
|
||||
"text": "This is a div with text."
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/1",
|
||||
"self_ref": "#/texts/2",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
@@ -85,7 +103,7 @@
|
||||
"text": "This is another div with text."
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/2",
|
||||
"self_ref": "#/texts/3",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
@@ -97,7 +115,7 @@
|
||||
"text": "This is a regular paragraph."
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/3",
|
||||
"self_ref": "#/texts/4",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
@@ -105,11 +123,23 @@
|
||||
"content_layer": "body",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "This is a third div\nwith a new line.",
|
||||
"text": "This is a third div\nwith a new line."
|
||||
"orig": "This is a third div",
|
||||
"text": "This is a third div"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/4",
|
||||
"self_ref": "#/texts/5",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
"children": [],
|
||||
"content_layer": "body",
|
||||
"label": "text",
|
||||
"prov": [],
|
||||
"orig": "with a new line.",
|
||||
"text": "with a new line."
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/6",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
@@ -121,7 +151,7 @@
|
||||
"text": "Heading for the details element"
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/5",
|
||||
"self_ref": "#/texts/7",
|
||||
"parent": {
|
||||
"$ref": "#/groups/0"
|
||||
},
|
||||
@@ -133,7 +163,7 @@
|
||||
"text": "Description of the details element."
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/6",
|
||||
"self_ref": "#/texts/8",
|
||||
"parent": {
|
||||
"$ref": "#/body"
|
||||
},
|
||||
|
||||
Reference in New Issue
Block a user