mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-26 03:55:00 +00:00
* added the contentlayer to html-backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the handle_image function Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code of html backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * test(html): add more info if a test case fails Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor(html): put parsed item in body if doc has no header In case an HTML does not have any header tag, all parsed items are placed in DoclingDocument's body content layer. HTML paragraphs ('p' tags) are parsed as text items with paragraph label. Update test ground truth accoring to the changes above. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: set TextItem label to 'text' instead of 'paragraph' Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
20 lines
1.1 KiB
Plaintext
20 lines
1.1 KiB
Plaintext
item-0 at level 0: unspecified: group _root_
|
|
item-1 at level 1: title: Example Document
|
|
item-2 at level 2: section_header: Introduction
|
|
item-3 at level 3: text: This is the first paragraph of the introduction.
|
|
item-4 at level 2: section_header: Background
|
|
item-5 at level 3: text: Some background information here.
|
|
item-6 at level 3: list: group list
|
|
item-7 at level 4: list_item: First item in unordered list
|
|
item-8 at level 5: list: group list
|
|
item-9 at level 6: list_item: Nested item 1
|
|
item-10 at level 6: list_item: Nested item 2
|
|
item-11 at level 4: list_item: Second item in unordered list
|
|
item-12 at level 3: ordered_list: group ordered list
|
|
item-13 at level 4: list_item: First item in ordered list
|
|
item-14 at level 5: ordered_list: group ordered list
|
|
item-15 at level 6: list_item: Nested ordered item 1
|
|
item-16 at level 6: list_item: Nested ordered item 2
|
|
item-17 at level 4: list_item: Second item in ordered list
|
|
item-18 at level 2: section_header: Data Table
|
|
item-19 at level 3: table with [4x3] |