Maxim Lysak
c803abed9a
feat: Rich tables support for HTML backend ( #2324 )
...
* Rich tables support for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* updated and added tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Refactored parse_table_data in html_backend into few smaller functions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Changing scope of few functions in html_backend.py, making them static, when possible
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-09-29 18:12:16 +02:00
Cesar Berrospi Ramis
fa3327e1a6
fix(html): preserve code blocks in list items ( #2131 )
...
* chore(html): refactor parser to leverage context managers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* fix(html): parse inline code snippets, also from list items
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* chore(html): remove hidden tags
Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
2025-08-26 06:43:48 +02:00
krrome
94fcc46aa9
feat(html): Support formatting tags in HTML texts ( #2111 )
...
* add parsing for formatting tags in HTML backend
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
fix latest tests + wiki_duck result files.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
* convert _collect_parent_format_tags to staticmethod
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
2025-08-22 10:37:34 +02:00
krrome
9687297262
feat(html): Support in-line anchor tags in HTML texts ( #1659 )
...
* re-implement links for html backend.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
* implement hack for images.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
2025-08-18 09:57:16 +02:00
Cesar Berrospi Ramis
a069b1175b
refactor(HTML): handle text from styled html ( #1960 )
...
* A new HTML backend that handles styled html (ignors it) as well as images.
Images are parsed as placeholders with a caption, if it exists.
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com >
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com >
* tests(HTML): re-enable test_ordered_lists
Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com >
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com >
2025-07-22 13:16:31 +02:00
Panos Vagenas
0533da1923
feat: leverage new list modeling, capture default markers ( #1856 )
...
* chore: update docling-core & regenerate test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update backends to leverage new list modeling
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* repin docling-core
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* ensure availability of latest docling-core API
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-27 16:37:15 +02:00
Cesar Berrospi Ramis
106951e71e
test: add missing ground truth files ( #1667 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-28 13:26:49 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-09 15:14:32 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags ( #1436 )
...
* fix(html): handle 'address' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(html): handle 'details' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-23 09:30:59 +02:00
Cesar Berrospi Ramis
f94da44ec5
fix(html): handle nested empty lists ( #1154 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m20s
Run Docs CI / build-docs (push) Failing after 49s
Address the case of nested lists in empty list items.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 16:56:58 +01:00
Cesar Berrospi Ramis
1b0ead6907
fix(html): Parse text in div elements as TextItem ( #1041 )
...
feat(html): Parse text in div elements as TextItem
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-24 12:38:29 +01:00
Cesar Berrospi Ramis
a112d7a035
fix: parse html with omitted body tag ( #818 )
...
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-27 16:59:00 +01:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
...
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restructure title fix (#187 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-30 13:14:56 +01:00