mirror of
https://github.com/DS4SD/docling.git
synced 2025-08-01 15:02:21 +00:00
* fix(html-backend): improve accordion extraction and hidden content handling - Add specialized handlers for Bootstrap accordion components to properly extract questions from panel-title elements - Implement is_hidden_element() method to detect and skip content with hidden classes, styles, and attributes - Update walk(), analyze_tag(), and extract_text_recursively() to filter out hidden elements - Add comprehensive test suite with direct method tests and example HTML files This fixes two issues: 1. Missing questions in accordion components 2. Unwanted extraction of hidden metadata content Tests: tests/test_html_enhanced.py Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com> * + html-backend itelsd Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com> * run pre-commit run --all-files --------- Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com> Co-authored-by: Ulan.Yisaev <ulan.yisaev@nortal.com> |
||
---|---|---|
.. | ||
data | ||
data_scanned | ||
__init__.py | ||
test_backend_asciidoc.py | ||
test_backend_csv.py | ||
test_backend_docling_json.py | ||
test_backend_docling_parse_v2.py | ||
test_backend_docling_parse.py | ||
test_backend_html.py | ||
test_backend_jats.py | ||
test_backend_markdown.py | ||
test_backend_msexcel.py | ||
test_backend_msword.py | ||
test_backend_patent_uspto.py | ||
test_backend_pdfium.py | ||
test_backend_pptx.py | ||
test_cli.py | ||
test_code_formula.py | ||
test_data_gen_flag.py | ||
test_document_picture_classifier.py | ||
test_e2e_conversion.py | ||
test_e2e_ocr_conversion.py | ||
test_html_enhanced.py | ||
test_input_doc.py | ||
test_interfaces.py | ||
test_invalid_input.py | ||
test_legacy_format_transform.py | ||
test_options.py | ||
verify_utils.py |