mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-09 13:18:24 +00:00
feat(html): Support formatting tags in HTML texts (#2111)
* add parsing for formatting tags in HTML backend Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix latest tests + wiki_duck result files. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * convert _collect_parent_format_tags to staticmethod Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
This commit is contained in:
45
tests/data/groundtruth/docling_v2/formatting.html.md
vendored
Normal file
45
tests/data/groundtruth/docling_v2/formatting.html.md
vendored
Normal file
@@ -0,0 +1,45 @@
|
||||
# HTML Text Formatting Examples
|
||||
|
||||
This is a **bold (b)** example and right next to it we have a **strong emphasis (strong)** .
|
||||
|
||||
Notice that
|
||||
|
||||
**strong + bold mixed** looks similar but carries additional semantic meaning.
|
||||
|
||||
Here is an *italic (i)* word and an *emphasis (em)* example.
|
||||
|
||||
Sometimes we combine them like
|
||||
|
||||
*italic + emphasis together* .
|
||||
|
||||
Now let's look at text that appears crossed out: ~~strikethrough with s~~ and ~~deleted with del~~ .
|
||||
|
||||
You can also mix them:
|
||||
|
||||
~~double strikethrough (s + del)~~ .
|
||||
|
||||
To highlight insertions or underlines: underlined with u , inserted with ins .
|
||||
|
||||
A combination could be:
|
||||
|
||||
underline + insertion together .
|
||||
|
||||
Subscript and superscript examples:
|
||||
|
||||
Water is written as H
|
||||
|
||||
2 O using sub.
|
||||
|
||||
The mathematical expression x
|
||||
|
||||
2 + y 3 uses sup.
|
||||
|
||||
They can also be combined: CO
|
||||
|
||||
2 * .
|
||||
|
||||
Mixing several: This sentence has ***strong + emphasis*** ,
|
||||
|
||||
some
|
||||
|
||||
**bold + underline** , and a formula like a 2 + b 3 .
|
||||
Reference in New Issue
Block a user