mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-31 14:34:40 +00:00
* fix(html-backend): improve accordion extraction and hidden content handling - Add specialized handlers for Bootstrap accordion components to properly extract questions from panel-title elements - Implement is_hidden_element() method to detect and skip content with hidden classes, styles, and attributes - Update walk(), analyze_tag(), and extract_text_recursively() to filter out hidden elements - Add comprehensive test suite with direct method tests and example HTML files This fixes two issues: 1. Missing questions in accordion components 2. Unwanted extraction of hidden metadata content Tests: tests/test_html_enhanced.py Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com> * + html-backend itelsd Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com> * run pre-commit run --all-files --------- Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com> Co-authored-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
50 lines
1.6 KiB
HTML
50 lines
1.6 KiB
HTML
<!DOCTYPE html>
|
|
<html>
|
|
<head>
|
|
<title>Hidden Elements Test</title>
|
|
</head>
|
|
<body>
|
|
<div class="container">
|
|
<h3>Visible Elements Test</h3>
|
|
|
|
<!-- Visible content that should be extracted -->
|
|
<p>This is a regular paragraph that should be extracted.</p>
|
|
|
|
<!-- Content with class="hidden" that should be skipped -->
|
|
<div class="hidden">
|
|
<p>This text has class="hidden" and should NOT be extracted.</p>
|
|
</div>
|
|
|
|
<!-- Content with style="display:none" that should be skipped -->
|
|
<div style="display:none">
|
|
<p>This text has style="display:none" and should NOT be extracted.</p>
|
|
</div>
|
|
|
|
<!-- Content with hidden attribute that should be skipped -->
|
|
<div hidden>
|
|
<p>This text has the hidden attribute and should NOT be extracted.</p>
|
|
</div>
|
|
|
|
<!-- Content with class="d-none" (Bootstrap) that should be skipped -->
|
|
<div class="d-none">
|
|
<p>This text has class="d-none" and should NOT be extracted.</p>
|
|
</div>
|
|
|
|
<!-- Content with class="invisible" (Bootstrap) that should be skipped -->
|
|
<div class="invisible">
|
|
<p>This text has class="invisible" and should NOT be extracted.</p>
|
|
</div>
|
|
|
|
<!-- Content with class="collapse" (Bootstrap) that should be skipped -->
|
|
<div class="collapse">
|
|
<p>This text has class="collapse" and should NOT be extracted.</p>
|
|
</div>
|
|
|
|
<!-- Visible content that should be extracted -->
|
|
<div class="visible-content">
|
|
<p>This is another regular paragraph that should be extracted.</p>
|
|
<div class="keywords hidden">Keywords that should NOT be extracted.</div>
|
|
</div>
|
|
</div>
|
|
</body>
|
|
</html> |