docling/tests/data
Ulan Yisaev 4c88d4fe14
Fix html backend accordion hidden (#1)
* fix(html-backend): improve accordion extraction and hidden content handling

   - Add specialized handlers for Bootstrap accordion components to properly extract
     questions from panel-title elements
   - Implement is_hidden_element() method to detect and skip content with hidden
     classes, styles, and attributes
   - Update walk(), analyze_tag(), and extract_text_recursively() to filter out
     hidden elements
   - Add comprehensive test suite with direct method tests and example HTML files

   This fixes two issues:
   1. Missing questions in accordion components
   2. Unwanted extraction of hidden metadata content

   Tests: tests/test_html_enhanced.py

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>

* + html-backend itelsd

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>

* run pre-commit run --all-files

---------

Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
Co-authored-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
2025-03-09 18:13:24 +02:00
..
asciidoc fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
csv feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
docx fix(docx): merged table cells not properly converted (#857) 2025-02-03 10:20:03 +01:00
groundtruth fix: Proper handling of orphan IDs in layout postprocessing (#1118) 2025-03-05 14:30:59 +01:00
html Fix html backend accordion hidden (#1) 2025-03-09 18:13:24 +02:00
jats feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
md fix(markdown): handle nested lists (#910) 2025-02-07 12:55:12 +01:00
pdf fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
pptx feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
uspto feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
xlsx fix: added extraction of byte-images in excel (#804) 2025-01-24 18:48:02 +01:00
2305.03393v1-pg9-img.png feat!: Docling v2 (#117) 2024-10-16 21:02:03 +02:00