Commit Graph

457 Commits

Author SHA1 Message Date
Rafael Teixeira de Lima
643e4918c3 Fix test file
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:27:18 +02:00
Rafael Teixeira de Lima
9557431b94 Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:25:16 +02:00
Rafael Teixeira de Lima
4949471e50 Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Rafael Teixeira de Lima
e7fc1a40ed Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Rafael Teixeira de Lima
ae2e0832cd Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Fernando Santos
a29d4f7429 feat: handle <code> tags as code blocks (#1320)
handle <code> tags as code blocks

Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Michele Dolfi
36006d5829 docs: add plugins docs (#1319)
add plugin docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Michele Dolfi
951127605d chore: update lock file (#1315)
chore: update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Maxim Lysak
b85e0196f6 fix(pptx): check if picture shape has an image attached (#1316)
Check if picture shape has an image attached in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Simon Jégou
fc306a7817 feat(docx): add text formatting and hyperlink support (#630)
* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Panos Vagenas
07f0846d42 docs: add visual grounding example (#1270)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Rafael Teixeira de Lima
870e33235d fix(docx): Improve text parsing (#1268)
* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Guilhem VERMOREL
fb36311e3a fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)
fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:20:54 +02:00
Rafael Teixeira de Lima
2ad8da9be9 Merge branch 'rtdl/new_latex_symbols' of github.com:DS4SD/docling into rtdl/new_latex_symbols 2025-04-08 16:19:25 +02:00
Rafael Teixeira de Lima
ee30d3e7f8 Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:19:14 +02:00
Rafael Teixeira de Lima
851baf1090 Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:19:14 +02:00
Rafael Teixeira de Lima
207cd78a26 Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 16:19:08 +02:00
Rafael Teixeira de Lima
556b949b18 Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-08 15:41:16 +02:00
Fernando Santos
0499cd1c1e
feat: handle <code> tags as code blocks (#1320)
handle <code> tags as code blocks

Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
2025-04-08 10:32:06 +02:00
Michele Dolfi
2e99e5a54f
docs: add plugins docs (#1319)
add plugin docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-08 09:44:37 +02:00
Michele Dolfi
61de30966f
chore: update lock file (#1315)
chore: update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-07 17:47:51 +02:00
Maxim Lysak
dc3bf9ceac
fix(pptx): check if picture shape has an image attached (#1316)
Check if picture shape has an image attached in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-07 17:36:56 +02:00
Rafael Teixeira de Lima
4bea04dc75 Identify headers through inhenrited style
Some checks failed
Run Docs CI / build-docs (push) Failing after 1m28s
Run CI / code-checks (push) Failing after 6m40s
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-04 14:46:43 +02:00
Rafael Teixeira de Lima
32b03b65f4
Merge branch 'main' into rtdl/new_latex_symbols
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-03 18:00:32 +02:00
Rafael Teixeira de Lima
64a7888092 Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-04-03 17:57:30 +02:00
Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support (#630)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 52s
* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-03 15:11:50 +02:00
Panos Vagenas
71148eb381
docs: add visual grounding example (#1270)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m28s
Run Docs CI / build-docs (push) Failing after 54s
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-02 14:03:19 +02:00
Rafael Teixeira de Lima
d2d68747f9
fix(docx): Improve text parsing (#1268)
* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
2025-04-02 12:56:44 +02:00
Guilhem VERMOREL
b3d111a3cd
fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m28s
Run Docs CI / build-docs (push) Failing after 53s
fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
2025-03-31 10:53:49 +02:00
github-actions[bot]
44f2b081ec chore: bump version to 2.28.4 [skip ci] 2025-03-29 11:56:42 +00:00
Maxim Lysak
7afad7e52d
fix: Fixes tables when using OCR (#1261)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m29s
Run Docs CI / build-docs (push) Failing after 51s
Fix for the tables when using OCR

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-29 10:06:00 +01:00
github-actions[bot]
124f921077 chore: bump version to 2.28.3 [skip ci] 2025-03-28 18:30:03 +00:00
Maxim Lysak
8bd71e8e33
fix: Word-level pdf cells for tables (#1238)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m34s
Run Docs CI / build-docs (push) Failing after 55s
* word-level pdf cells for tables

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed comments

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated dependency to docling-core

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-28 16:34:48 +01:00
github-actions[bot]
82694b2136 chore: bump version to 2.28.2 [skip ci] 2025-03-26 16:52:06 +00:00
Panos Vagenas
9210812bfa
fix: improve HTML layer detection, various MD fixes (#1241)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m31s
Run Docs CI / build-docs (push) Failing after 54s
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-26 16:07:14 +01:00
Panos Vagenas
85c4df887b
fix(html): fix HTML parsed heading level (#1244)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-26 10:30:23 +01:00
github-actions[bot]
9eb1686f93 chore: bump version to 2.28.1 [skip ci] 2025-03-25 18:20:23 +00:00
Panos Vagenas
38b7108a22
chore: update locked deps (#1239)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 51s
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-25 15:48:02 +01:00
mislavmartinic
825b226fab
fix(converter): Cache same pipeline class with different options (#1152)
* Update document_converter.py

Fixing caching same class with different options by using composite key (class, options)

# TODO this will ignore if different options have been defined for the same pipeline class.

at row 292

Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com>

* formatted script

* removed unnecessary hasattr check

* pre-commit chain run

---------

Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com>
2025-03-25 12:18:44 +01:00
Hoang-Long Do
6df8827231
fix(debug): Missing translation of bbox to to_bounding_box (#1220)
* Fix: Add missing bbox attribute to PdfTextCell

* Fix: Add missing bbox attribute to PdfTextCell

Signed-off-by: hl2311 <dhlong2301@gmail.com>

* fix: Refactor missing bbox attribute to PdfTextCell

Signed-off-by: hl2311 <dhlong2301@gmail.com>

* Signed-off-by: hl2311 <dhlong2301@gmail.com>

fix: Refactor missing bbox attribute to PdfTextCell

---------

Signed-off-by: hl2311 <dhlong2301@gmail.com>
2025-03-25 12:18:10 +01:00
Rafael Teixeira de Lima
f739d0e4c5
fix(docx): identifying numbered headers (#1231)
* Modifications to identify numbered headers

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Add style check

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-03-25 11:41:02 +01:00
Clément Doumouro
0974ba4e1c
docs(examples): batch conversion doc raises_on_error (#1147)
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-03-25 11:14:39 +01:00
Peter Dave Hello
8ebb0bf1a0
chore: properly clean up apt temporary files in Dockerfile (#1223)
Signed-off-by: Peter Dave Hello <hsu@peterdavehello.org>
2025-03-25 11:10:09 +01:00
github-actions[bot]
7df157204b chore: bump version to 2.28.0 [skip ci] 2025-03-19 15:18:10 +00:00
Maxim Lysak
1c26769785
feat(SmolDocling): Support MLX acceleration in VLM pipeline (#1199)
* Initial implementation to support MLX for VLM pipeline and SmolDocling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* mlx_model unit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Add CLI choices for VLM pipeline and model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Initial implementation to support MLX for VLM pipeline and SmolDocling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* mlx_model unit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Add CLI choices for VLM pipeline and model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated minimal vlm pipeline example

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* make vlm_pipeline python3.9 compatible

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed extract_text_from_backend definition

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated README

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated example

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated documentation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* corrections in the documentation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Consmetic changes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-19 15:38:54 +01:00
Maciej Wieczorek
b454aa1551
feat: Add PPTX notes slides (#474)
* feat: Add PPTX notes slides

Presenter notes may have useful information and should also be extracted.

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>

* feat: Move presenter notes into furniture

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>

---------

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>
2025-03-19 14:52:09 +01:00
Christoph Auer
f5adfb9724
fix: Determine correct page size in DoclingParseV4Backend (#1196)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m24s
Run Docs CI / build-docs (push) Failing after 51s
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-19 11:05:42 +01:00
Cesar Berrospi Ramis
d5f7798763
test(html): fix regression test after docling-core update (#1197)
Update docling-core dependency to version 2.23.3.
Fix regression test of HTML backend after docling-core dependency update.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-19 11:03:46 +01:00
Rafael Teixeira de Lima
0b707d0882
fix(msword): Fixing function return in equations handling (#1194)
* Fixing function return

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Add message

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-03-19 10:34:25 +01:00
Michele Dolfi
1d680b0a32
docs: Linux Foundation AI & Data (#1183)
* point the auxiliary files to the community repo and add lfai in README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs index

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-19 09:05:57 +01:00