* fix(html-backend): improve accordion extraction and hidden content handling
- Add specialized handlers for Bootstrap accordion components to properly extract
questions from panel-title elements
- Implement is_hidden_element() method to detect and skip content with hidden
classes, styles, and attributes
- Update walk(), analyze_tag(), and extract_text_recursively() to filter out
hidden elements
- Add comprehensive test suite with direct method tests and example HTML files
This fixes two issues:
1. Missing questions in accordion components
2. Unwanted extraction of hidden metadata content
Tests: tests/test_html_enhanced.py
Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
* + html-backend itelsd
Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
* run pre-commit run --all-files
---------
Signed-off-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
Co-authored-by: Ulan.Yisaev <ulan.yisaev@nortal.com>
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* restructure title fix (#187)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>