docling/tests/data/html/hidden_elements_test.html
Ulan Yisaev 4c88d4fe14
Fix html backend accordion hidden (#1)
* fix(html-backend): improve accordion extraction and hidden content handling

   - Add specialized handlers for Bootstrap accordion components to properly extract
     questions from panel-title elements
   - Implement is_hidden_element() method to detect and skip content with hidden
     classes, styles, and attributes
   - Update walk(), analyze_tag(), and extract_text_recursively() to filter out
     hidden elements
   - Add comprehensive test suite with direct method tests and example HTML files

   This fixes two issues:
   1. Missing questions in accordion components
   2. Unwanted extraction of hidden metadata content

   Tests: tests/test_html_enhanced.py

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>

* + html-backend itelsd

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>

* run pre-commit run --all-files

---------

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
Co-authored-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
2025-03-09 18:13:24 +02:00

50 lines
1.6 KiB
HTML

<!DOCTYPE html>
<html>
<head>
<title>Hidden Elements Test</title>
</head>
<body>
<div class="container">
<h3>Visible Elements Test</h3>
<!-- Visible content that should be extracted -->
<p>This is a regular paragraph that should be extracted.</p>
<!-- Content with class="hidden" that should be skipped -->
<div class="hidden">
<p>This text has class="hidden" and should NOT be extracted.</p>
</div>
<!-- Content with style="display:none" that should be skipped -->
<div style="display:none">
<p>This text has style="display:none" and should NOT be extracted.</p>
</div>
<!-- Content with hidden attribute that should be skipped -->
<div hidden>
<p>This text has the hidden attribute and should NOT be extracted.</p>
</div>
<!-- Content with class="d-none" (Bootstrap) that should be skipped -->
<div class="d-none">
<p>This text has class="d-none" and should NOT be extracted.</p>
</div>
<!-- Content with class="invisible" (Bootstrap) that should be skipped -->
<div class="invisible">
<p>This text has class="invisible" and should NOT be extracted.</p>
</div>
<!-- Content with class="collapse" (Bootstrap) that should be skipped -->
<div class="collapse">
<p>This text has class="collapse" and should NOT be extracted.</p>
</div>
<!-- Visible content that should be extracted -->
<div class="visible-content">
<p>This is another regular paragraph that should be extracted.</p>
<div class="keywords hidden">Keywords that should NOT be extracted.</div>
</div>
</div>
</body>
</html>