Commit Graph

467 Commits

Author SHA1 Message Date
Benichou
eb7980af0b fix/adding a commit with a signature Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:55:28 -04:00
Benichou
d9f07040f3 DCO Remediation Commit for Benichou <fbenichou@deloitte.ca>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:53:05 -04:00
Benichou
89dc98bd6f DCO Remediation Commit for Benichou <fbenichou@deloitte.ca>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:31:25 -04:00
Benichou
ed56086a65 fix/poetry_check Signed-off-by: Benichou <fbenichou@deloitte.ca> 2025-06-20 16:24:48 -04:00
Benichou
4420c38936 fix/ran poetry run pre-commit run --all-files to format the file Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:49 -04:00
Benichou
22bf211acf fix/removed generate=True in test_backend_pptx.py in verify_export method to not conflict with main branch Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:48 -04:00
Benichou
2e3c4e10cb fix/adding the missing slide size argument in the handle pictures in the mspowerpoint_backend.py file and adding generate=True in the verify export method in the pytest for pptx to ensure the pytest passes appropriately Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:48 -04:00
Benichou
a35d9bb8b8 fix: run poetry pre-commit all files to black format changes Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:48 -04:00
Benichou
82a9d27c96 fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:47 -04:00
Benichou
dda339397b fix/adding a commit with a signature Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:46 -04:00
Benichou
b0553e8812 fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:46 -04:00
Michele Dolfi
95e49705e8 chore: update lock file (#1315)
chore: update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:45 -04:00
Maxim Lysak
46fa6e5eb0 fix(pptx): check if picture shape has an image attached (#1316)
Check if picture shape has an image attached in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:45 -04:00
Simon Jégou
78dab32819 feat(docx): add text formatting and hyperlink support (#630)
* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:44 -04:00
Panos Vagenas
e652f134ee docs: add visual grounding example (#1270)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:44 -04:00
Rafael Teixeira de Lima
7af290e482 fix(docx): Improve text parsing (#1268)
* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:43 -04:00
Guilhem VERMOREL
4c741b53fa fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)
fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:42 -04:00
Benichou
020f79a5ee fix/poetry_check Signed-off-by: Benichou <fbenichou@deloitte.ca> 2025-06-20 16:16:03 -04:00
Benichou
d8d21489e0 Merge branch 'bug_1242/drawing_blip_fix' of https://github.com/benichou/docling into bug_1242/drawing_blip_fix
Merging remote to local
2025-06-05 11:29:09 -04:00
Benichou
6a432d9115 fix/ran poetry run pre-commit run --all-files to format the file Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:39 -04:00
Benichou
7468137c55 fix/removed generate=True in test_backend_pptx.py in verify_export method to not conflict with main branch Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:38 -04:00
Benichou
a5e8c2d1be fix/adding the missing slide size argument in the handle pictures in the mspowerpoint_backend.py file and adding generate=True in the verify export method in the pytest for pptx to ensure the pytest passes appropriately Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:38 -04:00
Benichou
30cfaaf39f fix: run poetry pre-commit all files to black format changes Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:38 -04:00
Benichou
45eb3e79f7 fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:37 -04:00
Benichou
d8873aa0c9 fix/adding a commit with a signature Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:37 -04:00
Benichou
f6d4e67559 fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:36 -04:00
Benichou
56208f6dc0 fix/ran poetry run pre-commit run --all-files to format the file Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr 2025-05-14 15:35:50 -04:00
Benichou
2077e51033 fix/removed generate=True in test_backend_pptx.py in verify_export method to not conflict with main branch Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr 2025-05-13 20:46:08 -04:00
Benichou
4e8bf2c4d3 fix/adding the missing slide size argument in the handle pictures in the mspowerpoint_backend.py file and adding generate=True in the verify export method in the pytest for pptx to ensure the pytest passes appropriately Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr 2025-05-13 20:34:56 -04:00
Benichou
9fcace4e47 fix: run poetry pre-commit all files to black format changes Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr 2025-04-14 22:43:44 -04:00
Benichou
253cfab15e fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-04-08 11:40:21 -04:00
Benichou
02f77bbabd fix/adding a commit with a signature Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-04-08 11:40:21 -04:00
Benichou
b7c3f2e984 fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip 2025-04-08 00:53:24 -04:00
Michele Dolfi
61de30966f
chore: update lock file (#1315)
chore: update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-07 17:47:51 +02:00
Maxim Lysak
dc3bf9ceac
fix(pptx): check if picture shape has an image attached (#1316)
Check if picture shape has an image attached in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-07 17:36:56 +02:00
Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support (#630)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 52s
* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-03 15:11:50 +02:00
Panos Vagenas
71148eb381
docs: add visual grounding example (#1270)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m28s
Run Docs CI / build-docs (push) Failing after 54s
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-02 14:03:19 +02:00
Rafael Teixeira de Lima
d2d68747f9
fix(docx): Improve text parsing (#1268)
* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
2025-04-02 12:56:44 +02:00
Guilhem VERMOREL
b3d111a3cd
fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m28s
Run Docs CI / build-docs (push) Failing after 53s
fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
2025-03-31 10:53:49 +02:00
github-actions[bot]
44f2b081ec chore: bump version to 2.28.4 [skip ci] 2025-03-29 11:56:42 +00:00
Maxim Lysak
7afad7e52d
fix: Fixes tables when using OCR (#1261)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m29s
Run Docs CI / build-docs (push) Failing after 51s
Fix for the tables when using OCR

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-29 10:06:00 +01:00
github-actions[bot]
124f921077 chore: bump version to 2.28.3 [skip ci] 2025-03-28 18:30:03 +00:00
Maxim Lysak
8bd71e8e33
fix: Word-level pdf cells for tables (#1238)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m34s
Run Docs CI / build-docs (push) Failing after 55s
* word-level pdf cells for tables

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed comments

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated dependency to docling-core

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-28 16:34:48 +01:00
github-actions[bot]
82694b2136 chore: bump version to 2.28.2 [skip ci] 2025-03-26 16:52:06 +00:00
Panos Vagenas
9210812bfa
fix: improve HTML layer detection, various MD fixes (#1241)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m31s
Run Docs CI / build-docs (push) Failing after 54s
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-26 16:07:14 +01:00
Panos Vagenas
85c4df887b
fix(html): fix HTML parsed heading level (#1244)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-26 10:30:23 +01:00
github-actions[bot]
9eb1686f93 chore: bump version to 2.28.1 [skip ci] 2025-03-25 18:20:23 +00:00
Panos Vagenas
38b7108a22
chore: update locked deps (#1239)
Some checks failed
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 51s
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-25 15:48:02 +01:00
mislavmartinic
825b226fab
fix(converter): Cache same pipeline class with different options (#1152)
* Update document_converter.py

Fixing caching same class with different options by using composite key (class, options)

# TODO this will ignore if different options have been defined for the same pipeline class.

at row 292

Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com>

* formatted script

* removed unnecessary hasattr check

* pre-commit chain run

---------

Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com>
2025-03-25 12:18:44 +01:00
Hoang-Long Do
6df8827231
fix(debug): Missing translation of bbox to to_bounding_box (#1220)
* Fix: Add missing bbox attribute to PdfTextCell

* Fix: Add missing bbox attribute to PdfTextCell

Signed-off-by: hl2311 <dhlong2301@gmail.com>

* fix: Refactor missing bbox attribute to PdfTextCell

Signed-off-by: hl2311 <dhlong2301@gmail.com>

* Signed-off-by: hl2311 <dhlong2301@gmail.com>

fix: Refactor missing bbox attribute to PdfTextCell

---------

Signed-off-by: hl2311 <dhlong2301@gmail.com>
2025-03-25 12:18:10 +01:00