docling/docling
Rafael Teixeira de Lima 14e9c0ce9a
fix(docx): Adding new latex symbols, simplifying how equations are added to text (#1295)
* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(docx): Improve text parsing (#1268)

* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* docs: add visual grounding example (#1270)

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat(docx): add text formatting and hyperlink support (#630)

* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(pptx): check if picture shape has an image attached (#1316)

Check if picture shape has an image attached in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: update lock file (#1315)

chore: update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* docs: add plugins docs (#1319)

add plugin docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat: handle <code> tags as code blocks (#1320)

handle <code> tags as code blocks

Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com>
2025-04-08 17:11:37 +02:00
..
backend fix(docx): Adding new latex symbols, simplifying how equations are added to text (#1295) 2025-04-08 17:11:37 +02:00
chunking feat: expose new hybrid chunker, update docs (#384) 2024-12-09 08:28:29 +01:00
cli feat(SmolDocling): Support MLX acceleration in VLM pipeline (#1199) 2025-03-19 15:38:54 +01:00
datamodel feat(SmolDocling): Support MLX acceleration in VLM pipeline (#1199) 2025-03-19 15:38:54 +01:00
models fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) 2025-03-31 10:53:49 +02:00
pipeline feat(SmolDocling): Support MLX acceleration in VLM pipeline (#1199) 2025-03-19 15:38:54 +01:00
utils feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905) 2025-03-18 10:38:19 +01:00
__init__.py Initial commit 2024-07-15 09:42:42 +02:00
document_converter.py fix(converter): Cache same pipeline class with different options (#1152) 2025-03-25 12:18:44 +01:00
exceptions.py feat: Introduce the enable_remote_services option to allow remote connections while processing (#941) 2025-02-12 15:18:01 +01:00
py.typed fix: Add py.typed marker file (#531) 2024-12-06 13:42:14 +01:00