docling

mirror of https://github.com/DS4SD/docling.git synced 2025-08-01 23:12:20 +00:00

History

Cesar Berrospi Ramis 38d622f22c refactor(html): put parsed item in body if doc has no header In case an HTML does not have any header tag, all parsed items are placed in DoclingDocument's body content layer. HTML paragraphs ('p' tags) are parsed as text items with paragraph label. Update test ground truth accoring to the changes above. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>		2025-02-28 18:03:58 +01:00
..
asciidoc	fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903 )	2025-02-07 08:43:31 +01:00
csv	feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )	2025-02-14 08:55:09 +01:00
docx	fix(docx): merged table cells not properly converted (#857 )	2025-02-03 10:20:03 +01:00
groundtruth	refactor(html): put parsed item in body if doc has no header	2025-02-28 18:03:58 +01:00
html	fix(html): Parse text in div elements as TextItem (#1041 )	2025-02-24 12:38:29 +01:00
jats	feat(xml-jats): parse XML JATS documents (#967 )	2025-02-17 10:43:31 +01:00
md	fix(markdown): handle nested lists (#910 )	2025-02-07 12:55:12 +01:00
pdf	fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903 )	2025-02-07 08:43:31 +01:00
pptx	feat: Extracting picture data for raster images found in PPTX (#349 )	2024-11-18 15:22:28 +01:00
uspto	feat: create a backend to parse USPTO patents into DoclingDocument (#606 )	2024-12-17 16:35:23 +01:00
xlsx	fix: added extraction of byte-images in excel (#804 )	2025-01-24 18:48:02 +01:00
2305.03393v1-pg9-img.png	feat!: Docling v2 (#117 )	2024-10-16 21:02:03 +02:00