docling/docling/backend
Rafael Teixeira de Lima 7af290e482 fix(docx): Improve text parsing (#1268)
* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:43 -04:00
..
docx feat: equations to latex in MSWord backend (with inline groups) (#1114) 2025-03-13 15:12:22 +01:00
json feat: add Docling JSON ingestion (#783) 2025-01-24 18:05:23 +01:00
xml fix: Pass tests, update docling-core to 2.22.0 (#1150) 2025-03-13 09:45:55 +01:00
__init__.py Initial commit 2024-07-15 09:42:42 +02:00
abstract_backend.py feat: add Docling JSON ingestion (#783) 2025-01-24 18:05:23 +01:00
asciidoc_backend.py fix: use first table row as col headers (#1156) 2025-03-13 15:34:18 +01:00
csv_backend.py fix: use first table row as col headers (#1156) 2025-03-13 15:34:18 +01:00
docling_parse_backend.py feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905) 2025-03-18 10:38:19 +01:00
docling_parse_v2_backend.py feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905) 2025-03-18 10:38:19 +01:00
docling_parse_v4_backend.py fix: Determine correct page size in DoclingParseV4Backend (#1196) 2025-03-19 11:05:42 +01:00
html_backend.py fix: improve HTML layer detection, various MD fixes (#1241) 2025-03-26 16:07:14 +01:00
md_backend.py fix: improve HTML layer detection, various MD fixes (#1241) 2025-03-26 16:07:14 +01:00
msexcel_backend.py fix: use first table row as col headers (#1156) 2025-03-13 15:34:18 +01:00
mspowerpoint_backend.py feat: Add PPTX notes slides (#474) 2025-03-19 14:52:09 +01:00
msword_backend.py fix(docx): Improve text parsing (#1268) 2025-06-20 16:16:43 -04:00
pdf_backend.py feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905) 2025-03-18 10:38:19 +01:00
pypdfium2_backend.py feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905) 2025-03-18 10:38:19 +01:00