docling/tests
Ulan Yisaev 4c88d4fe14
Fix html backend accordion hidden (#1)
* fix(html-backend): improve accordion extraction and hidden content handling

   - Add specialized handlers for Bootstrap accordion components to properly extract
     questions from panel-title elements
   - Implement is_hidden_element() method to detect and skip content with hidden
     classes, styles, and attributes
   - Update walk(), analyze_tag(), and extract_text_recursively() to filter out
     hidden elements
   - Add comprehensive test suite with direct method tests and example HTML files

   This fixes two issues:
   1. Missing questions in accordion components
   2. Unwanted extraction of hidden metadata content

   Tests: tests/test_html_enhanced.py

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>

* + html-backend itelsd

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>

* run pre-commit run --all-files

---------

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
Co-authored-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
2025-03-09 18:13:24 +02:00
..
data Fix html backend accordion hidden (#1) 2025-03-09 18:13:24 +02:00
data_scanned fix: Proper handling of orphan IDs in layout postprocessing (#1118) 2025-03-05 14:30:59 +01:00
__init__.py fix: Add unit tests (#51) 2024-08-30 14:08:20 +02:00
test_backend_asciidoc.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_backend_csv.py test: avoid testing exact JSON in CSV backend (#1038) 2025-02-24 08:10:40 +01:00
test_backend_docling_json.py feat: add Docling JSON ingestion (#783) 2025-01-24 18:05:23 +01:00
test_backend_docling_parse_v2.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_backend_docling_parse.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_backend_html.py refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
test_backend_jats.py test: avoid testing exact JSON in CSV backend (#1038) 2025-02-24 08:10:40 +01:00
test_backend_markdown.py fix(markdown): handle nested lists (#910) 2025-02-07 12:55:12 +01:00
test_backend_msexcel.py test: avoid testing exact JSON in CSV backend (#1038) 2025-02-24 08:10:40 +01:00
test_backend_msword.py test: avoid testing exact JSON in CSV backend (#1038) 2025-02-24 08:10:40 +01:00
test_backend_patent_uspto.py test: avoid testing exact JSON (#1027) 2025-02-20 16:20:07 +01:00
test_backend_pdfium.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_backend_pptx.py test: avoid testing exact JSON in CSV backend (#1038) 2025-02-24 08:10:40 +01:00
test_cli.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_code_formula.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_data_gen_flag.py fix(markdown): handle nested lists (#910) 2025-02-07 12:55:12 +01:00
test_document_picture_classifier.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_e2e_conversion.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_e2e_ocr_conversion.py feat: Python 3.13 support (#841) 2025-01-30 17:26:42 +01:00
test_html_enhanced.py Fix html backend accordion hidden (#1) 2025-03-09 18:13:24 +02:00
test_input_doc.py feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
test_interfaces.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_invalid_input.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_legacy_format_transform.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_options.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
verify_utils.py test: avoid testing exact JSON in CSV backend (#1038) 2025-02-24 08:10:40 +01:00