docling/tests/data
Cesar Berrospi Ramis a069b1175b
refactor(HTML): handle text from styled html (#1960)
* A new HTML backend that handles styled html (ignors it) as well as images.

Images are parsed as placeholders with a caption, if it exists.

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com>
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>

* tests(HTML): re-enable test_ordered_lists

Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
2025-07-22 13:16:31 +02:00
..
asciidoc fix(asciidoc): set default size when missing in image directive (#1769) 2025-06-16 10:38:46 +02:00
audio feat: Support audio input (#1763) 2025-06-23 14:47:26 +02:00
csv feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
docx fix(msword_backend): Identify text in the same line after an image #1425 (#1610) 2025-06-20 10:55:30 +02:00
groundtruth refactor(HTML): handle text from styled html (#1960) 2025-07-22 13:16:31 +02:00
html refactor(HTML): handle text from styled html (#1960) 2025-07-22 13:16:31 +02:00
jats fix: fix HTML table parser and JATS backend bugs (#1948) 2025-07-16 10:49:24 +02:00
md fix(markdown): make parsing of rich table cells valid (#1821) 2025-06-26 19:50:45 +02:00
pdf fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549) 2025-05-19 15:26:00 +02:00
pptx fix: pptx line break and space handling (#1664) 2025-06-16 10:44:30 +02:00
uspto feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
webp feat: enable precision control in float serialization (#1914) 2025-07-09 16:39:17 +02:00
xlsx feat: support xlsm files (#1520) 2025-06-10 16:55:59 +02:00
2305.03393v1-pg9-img.png feat!: Docling v2 (#117) 2024-10-16 21:02:03 +02:00