Benichou
2feb4b0c28
fix/removed generate=True in test_backend_pptx.py in verify_export method to not conflict with main branch Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 17:03:47 -04:00
Benichou
4ee7fd7747
DCO Remediation Commit for Benichou <fbenichou@deloitte.ca>
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 17:00:43 -04:00
Benichou
6cf9fd1008
fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:58:00 -04:00
Benichou
eb7980af0b
fix/adding a commit with a signature Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:55:28 -04:00
Benichou
d9f07040f3
DCO Remediation Commit for Benichou <fbenichou@deloitte.ca>
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:53:05 -04:00
Benichou
89dc98bd6f
DCO Remediation Commit for Benichou <fbenichou@deloitte.ca>
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:31:25 -04:00
Benichou
ed56086a65
fix/poetry_check Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:24:48 -04:00
Benichou
4420c38936
fix/ran poetry run pre-commit run --all-files to format the file Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:49 -04:00
Benichou
22bf211acf
fix/removed generate=True in test_backend_pptx.py in verify_export method to not conflict with main branch Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:48 -04:00
Benichou
2e3c4e10cb
fix/adding the missing slide size argument in the handle pictures in the mspowerpoint_backend.py file and adding generate=True in the verify export method in the pytest for pptx to ensure the pytest passes appropriately Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:48 -04:00
Benichou
a35d9bb8b8
fix: run poetry pre-commit all files to black format changes Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:48 -04:00
Benichou
82a9d27c96
fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:47 -04:00
Benichou
dda339397b
fix/adding a commit with a signature Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:46 -04:00
Benichou
b0553e8812
fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:46 -04:00
Michele Dolfi
95e49705e8
chore: update lock file ( #1315 )
...
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:45 -04:00
Maxim Lysak
46fa6e5eb0
fix(pptx): check if picture shape has an image attached ( #1316 )
...
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:45 -04:00
Simon Jégou
78dab32819
feat(docx): add text formatting and hyperlink support ( #630 )
...
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com>
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:44 -04:00
Panos Vagenas
e652f134ee
docs: add visual grounding example ( #1270 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:44 -04:00
Rafael Teixeira de Lima
7af290e482
fix(docx): Improve text parsing ( #1268 )
...
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:43 -04:00
Guilhem VERMOREL
4c741b53fa
fix: Tesseract OCR CLI can't process images composed with numbers only ( #1201 )
...
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:42 -04:00
Benichou
020f79a5ee
fix/poetry_check Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-20 16:16:03 -04:00
Benichou
d8d21489e0
Merge branch 'bug_1242/drawing_blip_fix' of https://github.com/benichou/docling into bug_1242/drawing_blip_fix
...
Merging remote to local
2025-06-05 11:29:09 -04:00
Benichou
6a432d9115
fix/ran poetry run pre-commit run --all-files to format the file Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:39 -04:00
Benichou
7468137c55
fix/removed generate=True in test_backend_pptx.py in verify_export method to not conflict with main branch Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:38 -04:00
Benichou
a5e8c2d1be
fix/adding the missing slide size argument in the handle pictures in the mspowerpoint_backend.py file and adding generate=True in the verify export method in the pytest for pptx to ensure the pytest passes appropriately Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:38 -04:00
Benichou
30cfaaf39f
fix: run poetry pre-commit all files to black format changes Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:38 -04:00
Benichou
45eb3e79f7
fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:37 -04:00
Benichou
d8873aa0c9
fix/adding a commit with a signature Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:37 -04:00
Benichou
f6d4e67559
fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-06-05 11:26:36 -04:00
Benichou
56208f6dc0
fix/ran poetry run pre-commit run --all-files to format the file Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
2025-05-14 15:35:50 -04:00
Benichou
2077e51033
fix/removed generate=True in test_backend_pptx.py in verify_export method to not conflict with main branch Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
2025-05-13 20:46:08 -04:00
Benichou
4e8bf2c4d3
fix/adding the missing slide size argument in the handle pictures in the mspowerpoint_backend.py file and adding generate=True in the verify export method in the pytest for pptx to ensure the pytest passes appropriately Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
2025-05-13 20:34:56 -04:00
Benichou
9fcace4e47
fix: run poetry pre-commit all files to black format changes Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
2025-04-14 22:43:44 -04:00
Benichou
253cfab15e
fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-04-08 11:40:21 -04:00
Benichou
02f77bbabd
fix/adding a commit with a signature Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr
...
Signed-off-by: Benichou <fbenichou@deloitte.ca>
2025-04-08 11:40:21 -04:00
Benichou
b7c3f2e984
fix/implementing the capture of pptx_image with the same method from docx backend by extracting the drawing blip
2025-04-08 00:53:24 -04:00
Michele Dolfi
61de30966f
chore: update lock file ( #1315 )
...
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-07 17:47:51 +02:00
Maxim Lysak
dc3bf9ceac
fix(pptx): check if picture shape has an image attached ( #1316 )
...
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-07 17:36:56 +02:00
Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support ( #630 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 52s
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com>
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-03 15:11:50 +02:00
Panos Vagenas
71148eb381
docs: add visual grounding example ( #1270 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m28s
Run Docs CI / build-docs (push) Failing after 54s
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-02 14:03:19 +02:00
Rafael Teixeira de Lima
d2d68747f9
fix(docx): Improve text parsing ( #1268 )
...
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
2025-04-02 12:56:44 +02:00
Guilhem VERMOREL
b3d111a3cd
fix: Tesseract OCR CLI can't process images composed with numbers only ( #1201 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m28s
Run Docs CI / build-docs (push) Failing after 53s
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
2025-03-31 10:53:49 +02:00
github-actions[bot]
44f2b081ec
chore: bump version to 2.28.4 [skip ci]
2025-03-29 11:56:42 +00:00
Maxim Lysak
7afad7e52d
fix: Fixes tables when using OCR ( #1261 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m29s
Run Docs CI / build-docs (push) Failing after 51s
Fix for the tables when using OCR
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-29 10:06:00 +01:00
github-actions[bot]
124f921077
chore: bump version to 2.28.3 [skip ci]
2025-03-28 18:30:03 +00:00
Maxim Lysak
8bd71e8e33
fix: Word-level pdf cells for tables ( #1238 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m34s
Run Docs CI / build-docs (push) Failing after 55s
* word-level pdf cells for tables
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* removed comments
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated dependency to docling-core
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-28 16:34:48 +01:00
github-actions[bot]
82694b2136
chore: bump version to 2.28.2 [skip ci]
2025-03-26 16:52:06 +00:00
Panos Vagenas
9210812bfa
fix: improve HTML layer detection, various MD fixes ( #1241 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m31s
Run Docs CI / build-docs (push) Failing after 54s
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-26 16:07:14 +01:00
Panos Vagenas
85c4df887b
fix(html): fix HTML parsed heading level ( #1244 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-26 10:30:23 +01:00
github-actions[bot]
9eb1686f93
chore: bump version to 2.28.1 [skip ci]
2025-03-25 18:20:23 +00:00