Files
docling/tests/data/html/html_rich_table_cells.html
Cesar Berrospi Ramis 0ba8d5d9e3 fix(html): slow table parsing (#2582)
* fix(html): simplify parsing of simple table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(html): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): ensure table cells with formatted text are parsed as RichTableCell

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): simplify process_rich_table_cells since only rich cells are processed

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): formatted cell runs should be parsed as text items respecting the order

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: pin latest docling-core and update uv.lock

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade dependencies on uv.lock

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:36 +01:00

168 lines
4.4 KiB
HTML
Vendored
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Rich Table Cells in HTML</title>
<style>
table { border-collapse: collapse; width: 90%; margin: 1em auto; }
th, td { border: 1px solid #aaa; padding: 0.5rem; text-align: left; vertical-align: top; }
th { background:#f2f2f2; }
</style>
</head>
<body>
<h1>Rich Table Cells in HTML</h1>
<!-- Simple data table -->
<table>
<caption>Basic duck facts</caption>
<thead>
<tr><th>Name</th><th>Habitat</th><th>Comment</th></tr>
</thead>
<tbody>
<!-- empty cell -->
<tr><td>Wood Duck</td><td>&nbsp;</td><td>Often seen near ponds.</td></tr>
<!-- plain text -->
<tr><td>Mallard</td><td>Ponds, lakes, rivers</td><td>Quack</td></tr>
<!-- formatted text -->
<tr>
<td>Goose (not a duck!)</td>
<td style="color:#777;">Water & wetlands</td>
<td><strong>Large</strong>, <em>loud</em>, <u>noisy</u>, <s>small</s></td>
</tr>
<!-- list -->
<tr>
<td>Teal</td>
<td>
<ul style="margin:0;padding-left:1.2rem;">
<li>Pond</li>
<li>Marsh</li>
<li>Riverbank</li>
</ul>
</td>
<td>
<ol style="margin:0;padding-left:1.2rem;">
<li>Fly south in winter</li>
<li>Build nest on ground</li>
</ol>
</td>
</tr>
</tbody>
</table>
<!-- Table with mixed cell content -->
<table>
<caption>Duck family tree (simplified)</caption>
<thead>
<tr><th>Genus</th><th>Species</th></tr>
</thead>
<tbody>
<tr>
<td>Aythya<br><small>(Diving ducks)</small></td>
<td>Hawser, Common Pochard</td>
</tr>
<tr>
<td>Lophonetta<br><small>(Pintail group)</small></td>
<td>Fulvous Whistling Duck</td>
</tr>
<tr>
<td>Oxyura<br><small>(Benthic ducks)</small></td>
<td>Wigee, Banded Waterscrew</td>
</tr>
</tbody>
</table>
<!-- Table with a mix of cell types and a nested table -->
<table>
<caption>Duckrelated actions</caption>
<thead>
<tr style="background:#cce5ff;">
<th colspan="2">Action</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Swim</strong></td>
<td>Gracefully glide on H<sub>2</sub>O surfaces.</td>
</tr>
<tr>
<td><em>Fly</em></td>
<td>&nbsp;</td> <!-- empty cell -->
</tr>
<tr>
<td><u>Quack</u></td>
<td>
<table>
<thead>
<tr><th>Type</th><th>Sound</th></tr>
</thead>
<tbody>
<tr>
<td>Short</td>
<td>“quak”</td>
</tr>
<tr>
<td>Long</td>
<td>“quaaaaaack”</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<!-- Table with links -->
<table>
<caption>Famous Ducks with Images</caption>
<thead>
<tr><th>Name</th><th>Description</th><th>Image</th></tr>
</thead>
<tbody>
<!-- Plain link to a PNG/JPG file -->
<tr>
<td>Donald Duck</td>
<td>Cartoon character.</td>
<td><a href="https://en.wikipedia.org/wiki/Donald_Duck#/media/File:Donald_Duck_angry_transparent_background.png" target="_blank">View PNG</a></td>
</tr>
<!-- Thumbnail image that opens in a new tab -->
<tr>
<td>White-headed duck</td>
<td>A small diving duck some 45 cm (18 in) long.</td>
<td>
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Witkopeend_-_white-headed_duck_-_Oxyura_leucocephala_3.tif/lossy-page1-1920px-Witkopeend_-_white-headed_duck_-_Oxyura_leucocephala_3.tif.jpg" target="_blank">
<img class="thumb"
src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Witkopeend_-_white-headed_duck_-_Oxyura_leucocephala_3.tif/lossy-page1-250px-Witkopeend_-_white-headed_duck_-_Oxyura_leucocephala_3.tif.jpg"
alt="White-headed duck thumbnail">
</a>
</td>
</tr>
<!-- Link to a larger image with a caption -->
<tr>
<td>Mandarin Duck</td>
<td>Known for its striking plumage.</td>
<td>
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/7/75/Mandarin_duck_%28Aix_galericulata%29.jpg/250px-Mandarin_duck_%28Aix_galericulata%29.jpg" target="_blank">
View FullSize Image
</a>
</td>
</tr>
<!-- Empty image cell (to illustrate the empty case) -->
<tr>
<td>Unknown Duck</td>
<td>No photo available.</td>
<td>&nbsp;</td>
</tr>
</tbody>
</table>
</body>
</html>