mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
feat(html): Support formatting tags in HTML texts (#2111)
* add parsing for formatting tags in HTML backend Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix latest tests + wiki_duck result files. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * convert _collect_parent_format_tags to staticmethod Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
This commit is contained in:
47
tests/data/html/formatting.html
vendored
Normal file
47
tests/data/html/formatting.html
vendored
Normal file
@@ -0,0 +1,47 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<title>HTML Formatting Tags Demo</title>
|
||||
</head>
|
||||
<body>
|
||||
<h1>HTML Text Formatting Examples</h1>
|
||||
|
||||
<p>
|
||||
This is a <b>bold (b)</b> example and right next to it we have a
|
||||
<strong>strong emphasis (strong)</strong>.
|
||||
Notice that <strong><b>strong + bold mixed</b></strong> looks similar but carries additional semantic meaning.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Here is an <i>italic (i)</i> word and an <em>emphasis (em)</em> example.
|
||||
Sometimes we combine them like <i><em>italic + emphasis together</em></i>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Now let's look at text that appears crossed out:
|
||||
<s>strikethrough with s</s> and
|
||||
<del>deleted with del</del>.
|
||||
You can also mix them: <s><del>double strikethrough (s + del)</del></s>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To highlight insertions or underlines:
|
||||
<u>underlined with u</u>,
|
||||
<ins>inserted with ins</ins>.
|
||||
A combination could be: <u><ins>underline + insertion together</ins></u>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Subscript and superscript examples:
|
||||
Water is written as H<sub>2</sub>O using sub.
|
||||
The mathematical expression x<sup>2</sup> + y<sup>3</sup> uses sup.
|
||||
They can also be combined: CO<sub>2</sub><sup>*</sup>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Mixing several: This sentence has <strong><em>strong + emphasis</em></strong>,
|
||||
some <b><u>bold + underline</u></b>, and a formula like a<sup>2</sup> + b<sub>3</sub>.
|
||||
</p>
|
||||
</body>
|
||||
</html>
|
||||
Reference in New Issue
Block a user