refactor: add the contentlayer to html-backend (#1040)

* added the contentlayer to html-backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the handle_image function

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code of html backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* test(html): add more info if a test case fails

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* refactor(html): put parsed item in body if doc has no header

In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: set TextItem label to 'text' instead of 'paragraph'

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
This commit is contained in:
Peter W. J. Staar
2025-03-02 10:37:53 -05:00
committed by GitHub
parent db3ceefd4a
commit e25d557c06
13 changed files with 623 additions and 709 deletions

View File

@@ -1,7 +1,7 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: paragraph: This is a div with text.
item-2 at level 1: paragraph: This is another div with text.
item-3 at level 1: paragraph: This is a regular paragraph.
item-4 at level 1: paragraph: This is a third div
item-1 at level 1: text: This is a div with text.
item-2 at level 1: text: This is another div with text.
item-3 at level 1: text: This is a regular paragraph.
item-4 at level 1: text: This is a third div
with a new line.
item-5 at level 1: paragraph: This is a fourth div with a bold paragraph.
item-5 at level 1: text: This is a fourth div with a bold paragraph.