Panos Vagenas
550b1ca2f8
chore: propagate docling-core fix ( #1389 )
...
* chore: propagate docling-core fix
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update lock to latest docling-core release
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-15 10:51:47 +02:00
Felix Dittrich
a7dd59c5cb
docs(ocr): Add docs entry for OnnxTR OCR plugin ( #1382 )
...
feat(ocr): Add docs entry for OnnxTR OCR plugin
Signed-off-by: felix <felixdittrich92@gmail.com >
2025-04-15 09:46:59 +02:00
Michele Dolfi
06227e9970
ci: sign pypi packages ( #1392 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-15 08:59:16 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff ( #1383 )
...
* add coverage calculation and push
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* new codecov version and usage of token
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* enable ruff formatter instead of black and isort
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff lint fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff unsafe fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add removed imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* runs 1 on linter issues
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* finalize linter fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update pyproject.toml
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-14 18:01:26 +02:00
Michele Dolfi
293c28ca7c
docs(security): more statements about secure development ( #1381 )
...
docs: more statement about secure development
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 13:53:26 +02:00
Michele Dolfi
01fbfd5652
docs: Add testing in the docs ( #1379 )
...
* add testing to CONTRIBUTING
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* document test generation
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* typo
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 12:31:48 +02:00
Michele Dolfi
d9c3999175
chore: update lock file ( #1378 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 10:38:10 +02:00
Juil Park
a026b4e84b
docs: Add Notes for Installing in Intel macOS ( #1377 )
...
docs: Add Notes for Intel macOS
Signed-off-by: Juil Park <park@juil.dev >
2025-04-14 10:21:13 +02:00
github-actions[bot]
c391adb5f0
chore: bump version to 2.30.0 [skip ci]
v2.30.0
2025-04-14 08:20:31 +00:00
Michele Dolfi
7e40ad3261
fix(deps): widen typer upper bound ( #1375 )
...
bump typer
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 09:23:39 +02:00
Peter W. J. Staar
c0ba88edf1
feat(cli): add option for html with split-page mode ( #1355 )
...
* updated the cli to output html in split-page mode
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* add pin for new docling-core with html split argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* relock with fixed html export in docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update more tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update lock with docling-core fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add again chunking extras
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 08:41:50 +02:00
Tim Kellogg
0de70e7991
fix: auto-recognize .xlsx, .docx and .pptx files ( #1340 )
...
* bug: auto-recognize .xlsx files
Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com >
* apply styling
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply to other ms office zip formats
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 07:45:13 +02:00
Simon Leiß
b295da4bfe
chore: Update repository URL in CITATION.cff ( #1363 )
...
Update repository URL in CITATION.cff
Repository was moved to docling-project/docling, so adjust the URL.
Signed-off-by: Simon Leiß <5084100+sleiss@users.noreply.github.com >
2025-04-14 06:57:04 +02:00
Cesar Berrospi Ramis
415b877984
fix(docx): declare image_data variable when handling pictures ( #1359 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 13:04:00 +02:00
Rowan Skewes
250399948d
fix: Implement PictureDescriptionApiOptions.bitmap_area_threshold ( #1248 )
...
fix: Implement PictureDescriptionApiOptions.picture_area_threshold
Signed-off-by: Rowan Skewes <rowan.skewes@gmail.com >
2025-04-11 11:14:05 +02:00
Cesar Berrospi Ramis
eef2bdea77
feat(xlsx): create a page for each worksheet in XLSX backend ( #1332 )
...
* sytle(xlsx): enforce type hints in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* feat(xlsx): create a page for each worksheet in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs(xlsx): add docstrings to XLSX backend module.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docling(xlsx): add bounding boxes and page size information in cell units
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 10:29:53 +02:00
Gabe Goodhart
c605edd8e9
feat: OllamaVlmModel for Granite Vision 3.2 ( #1337 )
...
* build: Add ollama sdk dependency
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add option plumbing for OllamaVlmOptions in pipeline_options
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Full implementation of OllamaVlmModel
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Connect "granite_vision_ollama" pipeline option to CLI
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* Revert "build: Add ollama sdk dependency"
After consideration, we're going to use the generic OpenAI API instead
of the Ollama-specific API to avoid duplicate work.
This reverts commit bc6b366468cdd66b52540aac9c7d8b584ab48ad0.
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Move OpenAI API call logic into utils.utils
This will allow reuse of this logic in a generic VLM model
NOTE: There is a subtle change here in the ordering of the text prompt and
the image in the call to the OpenAI API. When run against Ollama, this
ordering makes a big difference. If the prompt comes before the image, the
result is terse and not usable whereas the prompt coming after the image
works as expected and matches the non-OpenAI chat API.
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Refactor from Ollama SDK to generic OpenAI API
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Linting, formatting, and bug fixes
The one bug fix was in the timeout arg to openai_image_request. Otherwise,
this is all style changes to get MyPy and black passing cleanly.
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* remove model from download enum
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* generalize input args for other API providers
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename and refactor
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* require flag for remote services
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* disable example from CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add examples to docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-10 18:03:04 +02:00
Joan Fabrégat
6b696b504a
fix: Properly address page in pipeline _assemble_document when page_range is provided ( #1334 )
...
* Fixes #1333
Signed-off-by: Joan Fabrégat <j@fabreg.at >
* fix for the (dumb) MyPy type checker
Signed-off-by: Joan Fabrégat <j@fabreg.at >
---------
Signed-off-by: Joan Fabrégat <j@fabreg.at >
2025-04-10 16:11:28 +02:00
github-actions[bot]
72ab8e1821
chore: bump version to 2.29.0 [skip ci]
v2.29.0
2025-04-10 12:24:09 +00:00
Maxim Lysak
355d8dc7a6
chore: Logo parameter in docling CLI, prints cute ascii logo ( #1294 )
...
logo parameter in docling cli, prints cute ascii logo
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-09 05:29:48 +02:00
Rafael Teixeira de Lima
14e9c0ce9a
fix(docx): Adding new latex symbols, simplifying how equations are added to text ( #1295 )
...
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(docx): Improve text parsing (#1268 )
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add visual grounding example (#1270 )
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat(docx): add text formatting and hyperlink support (#630 )
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(pptx): check if picture shape has an image attached (#1316 )
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: update lock file (#1315 )
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add plugins docs (#1319 )
add plugin docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: handle <code> tags as code blocks (#1320 )
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com >
2025-04-08 17:11:37 +02:00
Fernando Santos
0499cd1c1e
feat: handle <code> tags as code blocks ( #1320 )
...
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
2025-04-08 10:32:06 +02:00
Michele Dolfi
2e99e5a54f
docs: add plugins docs ( #1319 )
...
add plugin docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-08 09:44:37 +02:00
Michele Dolfi
61de30966f
chore: update lock file ( #1315 )
...
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-07 17:47:51 +02:00
Maxim Lysak
dc3bf9ceac
fix(pptx): check if picture shape has an image attached ( #1316 )
...
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-07 17:36:56 +02:00
Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support ( #630 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 52s
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-03 15:11:50 +02:00
Panos Vagenas
71148eb381
docs: add visual grounding example ( #1270 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m28s
Run Docs CI / build-docs (push) Failing after 54s
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-02 14:03:19 +02:00
Rafael Teixeira de Lima
d2d68747f9
fix(docx): Improve text parsing ( #1268 )
...
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
2025-04-02 12:56:44 +02:00
Guilhem VERMOREL
b3d111a3cd
fix: Tesseract OCR CLI can't process images composed with numbers only ( #1201 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m28s
Run Docs CI / build-docs (push) Failing after 53s
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
2025-03-31 10:53:49 +02:00
github-actions[bot]
44f2b081ec
chore: bump version to 2.28.4 [skip ci]
v2.28.4
2025-03-29 11:56:42 +00:00
Maxim Lysak
7afad7e52d
fix: Fixes tables when using OCR ( #1261 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m29s
Run Docs CI / build-docs (push) Failing after 51s
Fix for the tables when using OCR
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-03-29 10:06:00 +01:00
github-actions[bot]
124f921077
chore: bump version to 2.28.3 [skip ci]
v2.28.3
2025-03-28 18:30:03 +00:00
Maxim Lysak
8bd71e8e33
fix: Word-level pdf cells for tables ( #1238 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m34s
Run Docs CI / build-docs (push) Failing after 55s
* word-level pdf cells for tables
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed comments
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated dependency to docling-core
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-03-28 16:34:48 +01:00
github-actions[bot]
82694b2136
chore: bump version to 2.28.2 [skip ci]
v2.28.2
2025-03-26 16:52:06 +00:00
Panos Vagenas
9210812bfa
fix: improve HTML layer detection, various MD fixes ( #1241 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m31s
Run Docs CI / build-docs (push) Failing after 54s
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-26 16:07:14 +01:00
Panos Vagenas
85c4df887b
fix(html): fix HTML parsed heading level ( #1244 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-26 10:30:23 +01:00
github-actions[bot]
9eb1686f93
chore: bump version to 2.28.1 [skip ci]
v2.28.1
2025-03-25 18:20:23 +00:00
Panos Vagenas
38b7108a22
chore: update locked deps ( #1239 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 51s
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-25 15:48:02 +01:00
mislavmartinic
825b226fab
fix(converter): Cache same pipeline class with different options ( #1152 )
...
* Update document_converter.py
Fixing caching same class with different options by using composite key (class, options)
# TODO this will ignore if different options have been defined for the same pipeline class.
at row 292
Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com >
* formatted script
* removed unnecessary hasattr check
* pre-commit chain run
---------
Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com >
2025-03-25 12:18:44 +01:00
Hoang-Long Do
6df8827231
fix(debug): Missing translation of bbox to to_bounding_box ( #1220 )
...
* Fix: Add missing bbox attribute to PdfTextCell
* Fix: Add missing bbox attribute to PdfTextCell
Signed-off-by: hl2311 <dhlong2301@gmail.com >
* fix: Refactor missing bbox attribute to PdfTextCell
Signed-off-by: hl2311 <dhlong2301@gmail.com >
* Signed-off-by: hl2311 <dhlong2301@gmail.com >
fix: Refactor missing bbox attribute to PdfTextCell
---------
Signed-off-by: hl2311 <dhlong2301@gmail.com >
2025-03-25 12:18:10 +01:00
Rafael Teixeira de Lima
f739d0e4c5
fix(docx): identifying numbered headers ( #1231 )
...
* Modifications to identify numbered headers
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Add style check
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-03-25 11:41:02 +01:00
Clément Doumouro
0974ba4e1c
docs(examples): batch conversion doc raises_on_error ( #1147 )
...
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
2025-03-25 11:14:39 +01:00
Peter Dave Hello
8ebb0bf1a0
chore: properly clean up apt temporary files in Dockerfile ( #1223 )
...
Signed-off-by: Peter Dave Hello <hsu@peterdavehello.org >
2025-03-25 11:10:09 +01:00
github-actions[bot]
7df157204b
chore: bump version to 2.28.0 [skip ci]
v2.28.0
2025-03-19 15:18:10 +00:00
Maxim Lysak
1c26769785
feat(SmolDocling): Support MLX acceleration in VLM pipeline ( #1199 )
...
* Initial implementation to support MLX for VLM pipeline and SmolDocling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* mlx_model unit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Add CLI choices for VLM pipeline and model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Initial implementation to support MLX for VLM pipeline and SmolDocling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* mlx_model unit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Add CLI choices for VLM pipeline and model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Updated minimal vlm pipeline example
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* make vlm_pipeline python3.9 compatible
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed extract_text_from_backend definition
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated README
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated example
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated documentation
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* corrections in the documentation
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Consmetic changes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-03-19 15:38:54 +01:00
Maciej Wieczorek
b454aa1551
feat: Add PPTX notes slides ( #474 )
...
* feat: Add PPTX notes slides
Presenter notes may have useful information and should also be extracted.
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
* feat: Move presenter notes into furniture
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
---------
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
2025-03-19 14:52:09 +01:00
Christoph Auer
f5adfb9724
fix: Determine correct page size in DoclingParseV4Backend ( #1196 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m24s
Run Docs CI / build-docs (push) Failing after 51s
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-03-19 11:05:42 +01:00
Cesar Berrospi Ramis
d5f7798763
test(html): fix regression test after docling-core update ( #1197 )
...
Update docling-core dependency to version 2.23.3.
Fix regression test of HTML backend after docling-core dependency update.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-19 11:03:46 +01:00
Rafael Teixeira de Lima
0b707d0882
fix(msword): Fixing function return in equations handling ( #1194 )
...
* Fixing function return
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Add message
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-03-19 10:34:25 +01:00
Michele Dolfi
1d680b0a32
docs: Linux Foundation AI & Data ( #1183 )
...
* point the auxiliary files to the community repo and add lfai in README
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update docs index
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-03-19 09:05:57 +01:00