760 Commits

Author SHA1 Message Date
github-actions[bot]
dcb57bf528 chore: bump version to 2.63.0 [skip ci] v2.63.0 2025-11-20 14:42:37 +00:00
Christoph Auer
2087c6bf9f fix: Respect document_timeout in new threaded StandardPdfPipeline (#2653)
* fix: Respect document_timeout in new threaded StandardPdfPipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* add test case to test_options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Make sure unprocessed pages are not getting into assemble_document

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-11-20 14:57:14 +01:00
Cesar Berrospi Ramis
54e65d9511 chore: update Milvus on examples and references to deprecated method (#2664)
* docs(examples): update the set up of Milvus Lite

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: remove references to deprecated save_as_document_tokens

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-20 13:22:45 +01:00
kadirpekel
ce5a099dfd docs: Add Hector as compatible AI agent platform integration (#2662)
docs: add Hector as compatible AI agent platform integration

Signed-off-by: Kadir Pekel <kadirpekel@gmail.com>
2025-11-20 13:02:47 +01:00
Peter W. J. Staar
b559813b9b feat: add save and load for conversion result (#2648)
* feat: added save_as_json and load_from_json to ConversionResult

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added a test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the save and load for ConversionResult

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the signature

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactored load/save into ConversionAssets

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the DoclingVersion class

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* renamed time_stamp to timestamp

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-11-20 12:45:26 +01:00
Cristi Burcă
6fb9a5f98a fix: In DocumentConverter.convert_string() make nullable name parameter optional (#2660)
* fix: In DocumentConverter.convert_string() make nullable name parameter actually optional

* DCO Remediation Commit for Cristi Burcă <mail@scribu.net>

I, Cristi Burcă <mail@scribu.net>, hereby add my Signed-off-by to this commit: 2b256e3528

Signed-off-by: Cristi Burcă <mail@scribu.net>

---------

Signed-off-by: Cristi Burcă <mail@scribu.net>
2025-11-20 06:24:27 +01:00
Michele Dolfi
463a3fd474 fix: Enable GPU for RapidOCR when available (#2659)
* add setting for using gpu

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-11-19 17:12:00 +01:00
Harry Ho
b216ad848d docs: Added documentation to use SuryaOCR via plugin docling-surya (#2533)
* docs: Added documentation to use SuryaOCR via plugin `docling-surya`

Signed-off-by: Harry Ho <kho7@student.umgc.edu>

* Add PyPI link for docling-surya package

Added a link to the PyPI page for docling-surya.

Signed-off-by: Harry Ho <4719770+harrykhh@users.noreply.github.com>

* Add licensing note for SuryaOCR integration

Added important licensing note regarding SuryaOCR integration. 

Signed-off-by: Harry Ho <4719770+harrykhh@users.noreply.github.com>

* Ran linter to reformat

Signed-off-by: Harry Ho <4719770+harrykhh@users.noreply.github.com>

---------

Signed-off-by: Harry Ho <kho7@student.umgc.edu>
Signed-off-by: Harry Ho <4719770+harrykhh@users.noreply.github.com>
Co-authored-by: Harry Ho <kho7@student.umgc.edu>
2025-11-19 15:27:24 +01:00
Robyn Johnson
03e7c7d924 docs: Fix broken homepage links (#2651)
* docs: Fix broken homepage links

Signed-off-by: Robyn J <robynjohnson@us.ibm.com>

* docs: Remediate sign-off

DCO Remediation Commit for Robyn J <bobbinrobyn@users.noreply.github.com>

I, Robyn J <bobbinrobyn@users.noreply.github.com>, hereby add my Signed-off-by to this commit: e873e24c11

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>

---------

Signed-off-by: Robyn J <robynjohnson@us.ibm.com>
Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>
2025-11-19 08:19:56 +01:00
Michele Dolfi
8af228f1e2 docs(examples): processing parquet file of images (#2641)
* add example processing parquet file of images

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* vlm using vllm api

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use openvino and add more docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add default input file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* change default to standard for running in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use simple rapidocr without openvino in the CI example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-11-19 06:39:25 +01:00
Michele Dolfi
da4c2e9dbe fix: remove py3.14 requirement for default rapidocr (#2639)
* remove py3.14 requirement for default rapidocr

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove easyocr

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-11-18 17:23:43 +01:00
Ryan Soliveres
d549445e78 docs: Move Installation and Quickstart (Usage) under Getting started (#2644)
* docs: Move Installation and Quickstart (Usage) under Getting started

Moved Installation and Usage (Quickstart) under Getting started section
Rename installation folder to documentation folder
Rename installation/index.md to documentation/installation.md
Duplicate usage/index.md to documentation directory and rename it to documentation/quickstart.md
Add redirection from installation and usage

Signed-off-by: Ryan S <ryansoliveres@users.noreply.github.com>

* docs: Move Installation and Quickstart under Getting started

Signed-off-by: ryansoliveres <ryan.soliveres@yahoo.com>

* docs: Move Installation and Quickstart under Getting started

Signed-off-by: ryansoliveres <ryan.soliveres@yahoo.com>

* git commit -m "DCO Remediation Commit for rysoliveres <ryan.soliveres@yahoo.com>

I, rysoliveres <ryan.soliveres@yahoo.com>, hereby add my Signed-off-by to this commit: b7ae13e3d8

Signed-off-by: rysoliveres <ryan.soliveres@yahoo.com>"

Signed-off-by: ryansoliveres <ryan.soliveres@yahoo.com>

* git commit --allow-empty -m "DCO Remediation Commit for rysoliveres <ryan.soliveres@yahoo.com>

I, rysoliveres <ryan.soliveres@yahoo.com>, hereby add my Signed-off-by to this commit: b7ae13e3d8

Signed-off-by: rysoliveres <ryan.soliveres@yahoo.com>"

Signed-off-by: ryansoliveres <ryan.soliveres@yahoo.com>

* DCO Remediation Commit for rysoliveres <ryan.soliveres@yahoo.com>

I, rysoliveres <ryan.soliveres@yahoo.com>, hereby add my Signed-off-by to this commit: b7ae13e3d8

Signed-off-by: rysoliveres <ryan.soliveres@yahoo.com>

Signed-off-by: ryansoliveres <ryan.soliveres@yahoo.com>

* DCO Remediation Commit for rysoliveres <ryan.soliveres@yahoo.com>

I, rysoliveres <ryan.soliveres@yahoo.com>, hereby add my Signed-off-by to this commit: b7ae13e3d8

Signed-off-by: rysoliveres <ryan.soliveres@yahoo.com>

Signed-off-by: ryansoliveres <ryan.soliveres@yahoo.com>

---------

Signed-off-by: Ryan S <ryansoliveres@users.noreply.github.com>
Signed-off-by: ryansoliveres <ryan.soliveres@yahoo.com>
2025-11-18 17:09:41 +01:00
Panos Vagenas
ac9fc585bb docs: add redirection from getting started page (#2640)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-11-17 14:13:51 +01:00
Cesar Berrospi Ramis
f5528623a7 docs(examples): remove deprecation warnings with export_to_dataframe (#2638)
fix: remove deprecation warnings with export_to_dataframe

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-17 12:48:41 +01:00
github-actions[bot]
d6ddf9f4cb chore: bump version to 2.62.0 [skip ci] v2.62.0 2025-11-17 11:34:08 +00:00
Peter W. J. Staar
3495b73de8 feat: add the Image backend (#2627)
* feat: add the Image backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fixed single- versus multi-frame image formats

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix: Proper usage of ImageDocumentBackend in the pipeline, deprecate old code.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Adapt tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: correct mets_gbs backend test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Make ImagePageBackend.get_bitmap_rects() yield

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-11-17 11:37:22 +01:00
Robyn Johnson
ae30373ee7 docs: combine Home and Getting Started pages (#2600)
* Update mkdocs.yml

Remove navigations.sections feature so that navigation menus will collapse & expand. They are collapsed by default.

* docs: add sign-off

DCO Remediation Commit for Robyn J <bobbinrobyn@users.noreply.github.com>

I, Robyn J <bobbinrobyn@users.noreply.github.com>, hereby add my Signed-off-by to this commit: b7d7441827

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>

* docs: Combine Home and Getting Started page

Combine home and getting stated pages, and rename the page "Documentation"

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>

---------

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>
2025-11-14 13:29:25 +01:00
Peter W. J. Staar
14b436d590 fix: correct the model-repo name (#2624)
* fix: correct the model-repo name

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* udated model-id

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-11-14 13:21:08 +01:00
Christoph Auer
4852d8b4f2 feat(experimental): Layout + VLM model with layout prompt (#2244)
* adding granite-docling preview

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the model specs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Add Layout+VLM pipeline with prompt injection, ApiVlmModel updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update layout injection, move to experimental

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust defaults

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Map Layout+VLM pipeline to GraniteDoclign

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove base_prompt from layout injection prompt

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reinstate custom prompt

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* add demo_layout file that produces with vs without layout injection

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: wrap vlm_inference around process_images

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: carry input prompt + number of input tokens

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* fix: adapt example to run on local test file

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* fix: example now expects single document

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: add layout example to EXAMPLES_TO_SKIP

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: address comments on git

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: add inference wrapper for hf_transformers + carry input prompt

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* Feat: add track_input_prompt to ApiVlmOptions, and track input prompt as part of api vlm

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* fix: Ensure backward-compatible build_prompt by adding _internal_page ag

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Ensure backward-compatible build_prompt by adding _internal_page ag

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for demo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Typing fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Restoring lost changes in vllm_model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Restoring vlm_pipeline_api_model example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: ElHachem02 <peterelhachem02@gmail.com>
2025-11-12 13:42:09 +01:00
Cesar Berrospi Ramis
054c4a634d fix(docx): parse page headers and footers (#2599)
* fix(docx): parse page headers and footers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): rename _add_header with _add_heading

To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): extend the page header and footer parsing to any content type

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): fix _add_header_footer function

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-10 16:10:12 +01:00
github-actions[bot]
463051b852 chore: bump version to 2.61.2 [skip ci] v2.61.2 2025-11-10 11:44:59 +00:00
Panos Vagenas
5c27567c41 fix: default to EasyOCR in Python 3.14 (#2605)
fix: default to EasyOCR in Python 3.14

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-11-10 12:09:00 +01:00
Peter W. J. Staar
06ae8ae29a chore: replace ds4sd with docling-project (#2596)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-11-07 11:25:56 +01:00
github-actions[bot]
c21327cd74 chore: bump version to 2.61.1 [skip ci] v2.61.1 2025-11-06 05:19:20 +00:00
Cesar Berrospi Ramis
ef623ffcee fix(docx): slow table parsing (#2553)
* chore(docx): remove unnecessary import

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): simplify parsing of simple tables

Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): reuse method for finding inline pictures

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): format strikethrough text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): use fixtures to avoid converting same file multiple times

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): remove unnecessary argument docx_obj in functions

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): small improvements in backend and its unit tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): parse superscript and subscript formatted text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:53 +01:00
Cesar Berrospi Ramis
0ba8d5d9e3 fix(html): slow table parsing (#2582)
* fix(html): simplify parsing of simple table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(html): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): ensure table cells with formatted text are parsed as RichTableCell

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): simplify process_rich_table_cells since only rich cells are processed

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): formatted cell runs should be parsed as text items respecting the order

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: pin latest docling-core and update uv.lock

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade dependencies on uv.lock

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:36 +01:00
Robyn Johnson
8da3d287ed docs: make navigation menus collapse and expand (#2573)
* Update mkdocs.yml

Remove navigations.sections feature so that navigation menus will collapse & expand. They are collapsed by default.

* docs: add sign-off

DCO Remediation Commit for Robyn J <bobbinrobyn@users.noreply.github.com>

I, Robyn J <bobbinrobyn@users.noreply.github.com>, hereby add my Signed-off-by to this commit: b7d7441827

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>

---------

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>
2025-11-06 05:25:19 +01:00
github-actions[bot]
0ccc0a3245 chore: bump version to 2.61.0 [skip ci] v2.61.0 2025-11-06 04:25:06 +00:00
Panos Vagenas
fa925741b6 fix: temporarily pin NuExtract to working revision (#2588)
* fix: temporarily pin NuExtract revision

NuExtract rev 489efed was causing MPS errors

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Revise revision comment for NuExtract transformer

Updated revision comment for NU_EXTRACT_2B_TRANSFORMERS.

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* pass revision to model download

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-11-05 21:23:12 +01:00
peets
6a04e27352 feat(vlm): track generated tokens and stop reasons for VLM models (#2543)
* feat: add enum StopReason and use it in VlmPrediction

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* add vlm_inference time for api calls and track stop reason

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* fix: rename enum to VlmStopReason

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* Propagate partial success status if page reaches max tokens

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: page with generation stopped by loop detector create partial success status

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* Add hint for future improvement

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* fix: remove vlm_stop_reason from extracted page data, add UNSPECIFIED state as VlmStopReason to avoid null value

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

---------

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
2025-11-04 19:39:09 +01:00
정물결
1a5146abc9 fix(ocr): use PSM integer values directly instead of constructor (#2578)
* fix(ocr): use PSM integer values directly instead of constructor

- Use integer psm value directly instead of calling tesserocr.PSM()
- Fixed in both main_psm and script_readers initialization
- tesserocr.PSM is a class with integer constants, not an enum

Fixes #2576

* DCO Remediation Commit for mulgyeol <mulgyeoljung@gmail.com>

I, mulgyeol <mulgyeoljung@gmail.com>, hereby add my Signed-off-by to this commit: da63a17a3c

Signed-off-by: mulgyeol <mulgyeoljung@gmail.com>

---------

Signed-off-by: mulgyeol <mulgyeoljung@gmail.com>
2025-11-04 19:32:41 +01:00
github-actions[bot]
32a5aed5ea chore: bump version to 2.60.1 [skip ci] v2.60.1 2025-11-04 11:26:12 +00:00
Panos Vagenas
0e1b0bd816 chore: switch print statements to debug logging (#2569)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-11-04 11:32:39 +01:00
Johannes Damp
fb737d026e chore: fix malformed f-string (#2563)
* fix: incorrect f-string in docling.datamodel.document

* DCO Remediation Commit for Johannes Damp <jdamp@users.noreply.github.com>

I, Johannes Damp <jdamp@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0f690a863a

Signed-off-by: Johannes Damp <jdamp@users.noreply.github.com>

---------

Signed-off-by: Johannes Damp <jdamp@users.noreply.github.com>
2025-11-04 11:01:26 +01:00
peets
8360aa5449 fix: extract response from api_image_request in picture description (#2571)
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
2025-11-04 08:39:15 +01:00
github-actions[bot]
3467b0a035 chore: bump version to 2.60.0 [skip ci] v2.60.0 2025-10-31 14:43:29 +00:00
Michele Dolfi
268d027c8f feat: Use threading in the standard pipeline and move old behavior to legacy (#2452)
* rename standard to legacy

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove old standard pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move threaded to standard

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add backwards compatible threaded pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Updates for threaded pipeline to lower memory requirements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* updating deps seem to remove the corrupted double-linked list error

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use main lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more threadsafe blocks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename batch_timeout_seconds

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-10-31 14:42:11 +01:00
Welteam
01577e92d1 docs: Update link to Open WebUI docs (#2549)
Fix dead link to Open WebUI docs

Signed-off-by: Welteam <8932313+Welteam@users.noreply.github.com>
2025-10-31 13:21:11 +01:00
Michele Dolfi
cb100437fa docs: Update installation options with extras and review FAQ (#2548)
* revise install docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more FAQ

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-31 13:21:01 +01:00
Yasir Ali
741c44fa45 docs: fix typos (#2546)
docs: fix typos in enrichments.md ('analize' -> 'analyze', 'consise' -> 'concise')

Signed-off-by: Yasir Ali <engr23002@gmail.com>
2025-10-31 10:29:34 +01:00
Michele Dolfi
a51275d080 fix(pdf): threadsafe for pypdfium2 backend (#2527)
* add threadsafe test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test threaded pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test_pypdfium_threaded_pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more threadsafe blocks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix threadsafe in pypdfium backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove unneccessary tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restore clean test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-30 17:58:39 +01:00
github-actions[bot]
d27fe92e01 chore: bump version to 2.59.0 [skip ci] v2.59.0 2025-10-30 13:05:56 +00:00
Michele Dolfi
97aa06bfbc docs: Add details and examples on optimal GPU setup (#2531)
* docs for GPU optimizations

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* improve time reporting and improve execution

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix standard pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* tune examples with batch size 64

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add benchmark results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* improve docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* typo in excluded tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* explicit pipeline in table

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-30 13:22:05 +01:00
glypt
d9c90eb45e fix: xlsx cell parsing, now returning values instead of formulas (#2520)
* fix: xlsx doc parsing, now returning values instead of formulas

Signed-off-by: glypt <8trash-can8@protonmail.ch>

* fix: add test for better coverage of xlsx backend

Signed-off-by: glypt <8trash-can8@protonmail.ch>

* fix: add the total of ducks as a formula in the tests/data

This also adds the test that the value 310 is contained in the table.
Without the fix from the previous commit, it would return "B7+C7"

Signed-off-by: glypt <8trash-can8@protonmail.ch>

---------

Signed-off-by: glypt <8trash-can8@protonmail.ch>
2025-10-29 11:35:51 +01:00
peets
b6c892b505 feat(vlm): add num_tokens as attribtue for VlmPrediction (#2489)
* feat: add num_tokens as attribtue for VlmPrediction

* feat: implement tokens tracking for api_vlm

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* DCO Remediation Commit for ElHachem02 <peterelhachem02@gmail.com>

I, ElHachem02 <peterelhachem02@gmail.com>, hereby add my Signed-off-by to this commit: 311287f562

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* DCO Remediation Commit for ElHachem02 <peterelhachem02@gmail.com>

I, ElHachem02 <peterelhachem02@gmail.com>, hereby add my Signed-off-by to this commit: 311287f562

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* update return type

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* add time recorder for vlm inference and track generated token ids depending on config

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* update num_tokens to have None as value on exception

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* set default value of num_tokens to None

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

---------

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Signed-off-by: peets <100425207+ElHachem02@users.noreply.github.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
2025-10-28 17:18:44 +01:00
Michele Dolfi
cdffb47b9a feat: Support for Python 3.14 (#2530)
* fix dependencies for py314

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add metadata and CI tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add back gliner

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update error message about python 3.14 availability

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip tests which cannot run on py 3.14

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix lint

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove vllm from py 3.14 deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* safe import for vllm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torch.compile()

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update checkbox results after docling-core changes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cannot run mlx example in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test for rapidocr backends and skip onnxruntime on py3.14

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix other occurances of torch.compile()

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow torch.compile for Python <3.14. proper support will be introduced with new torch releases

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-28 14:32:15 +01:00
Cesar Berrospi Ramis
9a6fdf936b docs: update opensearch notebook and backend documentation (#2519)
* docs(opensearch): update the example notebook RAG with OpenSearch

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs(uspto): remove direct usage of the backend class for conversion

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: remove direct usage of backends from documentation

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-27 10:02:50 +01:00
github-actions[bot]
10c1f06b74 chore: bump version to 2.58.0 [skip ci] v2.58.0 2025-10-22 11:31:29 +00:00
Michele Dolfi
bbe82a68d0 feat(pdf): Support for password-protected PDF documents (#2499)
* add test and example for PDF with password

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use docling-parse with new password feature

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pdfbackendoptions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* generalize backend_options and add PdfBackendOptions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pdf-password option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update exception test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix docs description

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-22 12:48:01 +02:00
Michele Dolfi
89820d01b5 perf: use docling-parse-v4 as default (#2503)
use doclnig-parse-v4 as default

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-21 17:55:43 +02:00