Commit Graph

748 Commits

Author SHA1 Message Date
Panos Vagenas
ac9fc585bb docs: add redirection from getting started page (#2640)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-11-17 14:13:51 +01:00
Cesar Berrospi Ramis
f5528623a7 docs(examples): remove deprecation warnings with export_to_dataframe (#2638)
fix: remove deprecation warnings with export_to_dataframe

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-17 12:48:41 +01:00
github-actions[bot]
d6ddf9f4cb chore: bump version to 2.62.0 [skip ci] v2.62.0 2025-11-17 11:34:08 +00:00
Peter W. J. Staar
3495b73de8 feat: add the Image backend (#2627)
* feat: add the Image backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fixed single- versus multi-frame image formats

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix: Proper usage of ImageDocumentBackend in the pipeline, deprecate old code.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Adapt tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: correct mets_gbs backend test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Make ImagePageBackend.get_bitmap_rects() yield

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-11-17 11:37:22 +01:00
Robyn Johnson
ae30373ee7 docs: combine Home and Getting Started pages (#2600)
* Update mkdocs.yml

Remove navigations.sections feature so that navigation menus will collapse & expand. They are collapsed by default.

* docs: add sign-off

DCO Remediation Commit for Robyn J <bobbinrobyn@users.noreply.github.com>

I, Robyn J <bobbinrobyn@users.noreply.github.com>, hereby add my Signed-off-by to this commit: b7d7441827

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>

* docs: Combine Home and Getting Started page

Combine home and getting stated pages, and rename the page "Documentation"

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>

---------

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>
2025-11-14 13:29:25 +01:00
Peter W. J. Staar
14b436d590 fix: correct the model-repo name (#2624)
* fix: correct the model-repo name

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* udated model-id

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-11-14 13:21:08 +01:00
Christoph Auer
4852d8b4f2 feat(experimental): Layout + VLM model with layout prompt (#2244)
* adding granite-docling preview

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the model specs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Add Layout+VLM pipeline with prompt injection, ApiVlmModel updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update layout injection, move to experimental

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust defaults

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Map Layout+VLM pipeline to GraniteDoclign

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove base_prompt from layout injection prompt

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reinstate custom prompt

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* add demo_layout file that produces with vs without layout injection

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: wrap vlm_inference around process_images

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: carry input prompt + number of input tokens

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* fix: adapt example to run on local test file

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* fix: example now expects single document

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: add layout example to EXAMPLES_TO_SKIP

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: address comments on git

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: add inference wrapper for hf_transformers + carry input prompt

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* Feat: add track_input_prompt to ApiVlmOptions, and track input prompt as part of api vlm

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* fix: Ensure backward-compatible build_prompt by adding _internal_page ag

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Ensure backward-compatible build_prompt by adding _internal_page ag

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for demo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Typing fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Restoring lost changes in vllm_model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Restoring vlm_pipeline_api_model example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: ElHachem02 <peterelhachem02@gmail.com>
2025-11-12 13:42:09 +01:00
Cesar Berrospi Ramis
054c4a634d fix(docx): parse page headers and footers (#2599)
* fix(docx): parse page headers and footers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): rename _add_header with _add_heading

To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): extend the page header and footer parsing to any content type

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): fix _add_header_footer function

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-10 16:10:12 +01:00
github-actions[bot]
463051b852 chore: bump version to 2.61.2 [skip ci] v2.61.2 2025-11-10 11:44:59 +00:00
Panos Vagenas
5c27567c41 fix: default to EasyOCR in Python 3.14 (#2605)
fix: default to EasyOCR in Python 3.14

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-11-10 12:09:00 +01:00
Peter W. J. Staar
06ae8ae29a chore: replace ds4sd with docling-project (#2596)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-11-07 11:25:56 +01:00
github-actions[bot]
c21327cd74 chore: bump version to 2.61.1 [skip ci] v2.61.1 2025-11-06 05:19:20 +00:00
Cesar Berrospi Ramis
ef623ffcee fix(docx): slow table parsing (#2553)
* chore(docx): remove unnecessary import

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): simplify parsing of simple tables

Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): reuse method for finding inline pictures

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): format strikethrough text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): use fixtures to avoid converting same file multiple times

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(docx): remove unnecessary argument docx_obj in functions

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(docx): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): small improvements in backend and its unit tests

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(docx): parse superscript and subscript formatted text

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:53 +01:00
Cesar Berrospi Ramis
0ba8d5d9e3 fix(html): slow table parsing (#2582)
* fix(html): simplify parsing of simple table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(html): add test for rich table cells

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): ensure table cells with formatted text are parsed as RichTableCell

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): simplify process_rich_table_cells since only rich cells are processed

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): formatted cell runs should be parsed as text items respecting the order

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: pin latest docling-core and update uv.lock

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade dependencies on uv.lock

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-11-06 05:25:36 +01:00
Robyn Johnson
8da3d287ed docs: make navigation menus collapse and expand (#2573)
* Update mkdocs.yml

Remove navigations.sections feature so that navigation menus will collapse & expand. They are collapsed by default.

* docs: add sign-off

DCO Remediation Commit for Robyn J <bobbinrobyn@users.noreply.github.com>

I, Robyn J <bobbinrobyn@users.noreply.github.com>, hereby add my Signed-off-by to this commit: b7d7441827

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>

---------

Signed-off-by: Robyn J <bobbinrobyn@users.noreply.github.com>
2025-11-06 05:25:19 +01:00
github-actions[bot]
0ccc0a3245 chore: bump version to 2.61.0 [skip ci] v2.61.0 2025-11-06 04:25:06 +00:00
Panos Vagenas
fa925741b6 fix: temporarily pin NuExtract to working revision (#2588)
* fix: temporarily pin NuExtract revision

NuExtract rev 489efed was causing MPS errors

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Revise revision comment for NuExtract transformer

Updated revision comment for NU_EXTRACT_2B_TRANSFORMERS.

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* pass revision to model download

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-11-05 21:23:12 +01:00
peets
6a04e27352 feat(vlm): track generated tokens and stop reasons for VLM models (#2543)
* feat: add enum StopReason and use it in VlmPrediction

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* add vlm_inference time for api calls and track stop reason

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* fix: rename enum to VlmStopReason

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* Propagate partial success status if page reaches max tokens

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* feat: page with generation stopped by loop detector create partial success status

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* Add hint for future improvement

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* fix: remove vlm_stop_reason from extracted page data, add UNSPECIFIED state as VlmStopReason to avoid null value

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

---------

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
2025-11-04 19:39:09 +01:00
정물결
1a5146abc9 fix(ocr): use PSM integer values directly instead of constructor (#2578)
* fix(ocr): use PSM integer values directly instead of constructor

- Use integer psm value directly instead of calling tesserocr.PSM()
- Fixed in both main_psm and script_readers initialization
- tesserocr.PSM is a class with integer constants, not an enum

Fixes #2576

* DCO Remediation Commit for mulgyeol <mulgyeoljung@gmail.com>

I, mulgyeol <mulgyeoljung@gmail.com>, hereby add my Signed-off-by to this commit: da63a17a3c

Signed-off-by: mulgyeol <mulgyeoljung@gmail.com>

---------

Signed-off-by: mulgyeol <mulgyeoljung@gmail.com>
2025-11-04 19:32:41 +01:00
github-actions[bot]
32a5aed5ea chore: bump version to 2.60.1 [skip ci] v2.60.1 2025-11-04 11:26:12 +00:00
Panos Vagenas
0e1b0bd816 chore: switch print statements to debug logging (#2569)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-11-04 11:32:39 +01:00
Johannes Damp
fb737d026e chore: fix malformed f-string (#2563)
* fix: incorrect f-string in docling.datamodel.document

* DCO Remediation Commit for Johannes Damp <jdamp@users.noreply.github.com>

I, Johannes Damp <jdamp@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0f690a863a

Signed-off-by: Johannes Damp <jdamp@users.noreply.github.com>

---------

Signed-off-by: Johannes Damp <jdamp@users.noreply.github.com>
2025-11-04 11:01:26 +01:00
peets
8360aa5449 fix: extract response from api_image_request in picture description (#2571)
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
2025-11-04 08:39:15 +01:00
github-actions[bot]
3467b0a035 chore: bump version to 2.60.0 [skip ci] v2.60.0 2025-10-31 14:43:29 +00:00
Michele Dolfi
268d027c8f feat: Use threading in the standard pipeline and move old behavior to legacy (#2452)
* rename standard to legacy

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove old standard pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move threaded to standard

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add backwards compatible threaded pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Updates for threaded pipeline to lower memory requirements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* updating deps seem to remove the corrupted double-linked list error

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use main lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more threadsafe blocks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename batch_timeout_seconds

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-10-31 14:42:11 +01:00
Welteam
01577e92d1 docs: Update link to Open WebUI docs (#2549)
Fix dead link to Open WebUI docs

Signed-off-by: Welteam <8932313+Welteam@users.noreply.github.com>
2025-10-31 13:21:11 +01:00
Michele Dolfi
cb100437fa docs: Update installation options with extras and review FAQ (#2548)
* revise install docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more FAQ

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-31 13:21:01 +01:00
Yasir Ali
741c44fa45 docs: fix typos (#2546)
docs: fix typos in enrichments.md ('analize' -> 'analyze', 'consise' -> 'concise')

Signed-off-by: Yasir Ali <engr23002@gmail.com>
2025-10-31 10:29:34 +01:00
Michele Dolfi
a51275d080 fix(pdf): threadsafe for pypdfium2 backend (#2527)
* add threadsafe test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test threaded pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test_pypdfium_threaded_pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more threadsafe blocks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix threadsafe in pypdfium backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove unneccessary tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restore clean test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-30 17:58:39 +01:00
github-actions[bot]
d27fe92e01 chore: bump version to 2.59.0 [skip ci] v2.59.0 2025-10-30 13:05:56 +00:00
Michele Dolfi
97aa06bfbc docs: Add details and examples on optimal GPU setup (#2531)
* docs for GPU optimizations

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* improve time reporting and improve execution

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix standard pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* tune examples with batch size 64

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add benchmark results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* improve docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* typo in excluded tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* explicit pipeline in table

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-30 13:22:05 +01:00
glypt
d9c90eb45e fix: xlsx cell parsing, now returning values instead of formulas (#2520)
* fix: xlsx doc parsing, now returning values instead of formulas

Signed-off-by: glypt <8trash-can8@protonmail.ch>

* fix: add test for better coverage of xlsx backend

Signed-off-by: glypt <8trash-can8@protonmail.ch>

* fix: add the total of ducks as a formula in the tests/data

This also adds the test that the value 310 is contained in the table.
Without the fix from the previous commit, it would return "B7+C7"

Signed-off-by: glypt <8trash-can8@protonmail.ch>

---------

Signed-off-by: glypt <8trash-can8@protonmail.ch>
2025-10-29 11:35:51 +01:00
peets
b6c892b505 feat(vlm): add num_tokens as attribtue for VlmPrediction (#2489)
* feat: add num_tokens as attribtue for VlmPrediction

* feat: implement tokens tracking for api_vlm

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* DCO Remediation Commit for ElHachem02 <peterelhachem02@gmail.com>

I, ElHachem02 <peterelhachem02@gmail.com>, hereby add my Signed-off-by to this commit: 311287f562

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* DCO Remediation Commit for ElHachem02 <peterelhachem02@gmail.com>

I, ElHachem02 <peterelhachem02@gmail.com>, hereby add my Signed-off-by to this commit: 311287f562

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* update return type

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* add time recorder for vlm inference and track generated token ids depending on config

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* update num_tokens to have None as value on exception

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* set default value of num_tokens to None

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

---------

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Signed-off-by: peets <100425207+ElHachem02@users.noreply.github.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
2025-10-28 17:18:44 +01:00
Michele Dolfi
cdffb47b9a feat: Support for Python 3.14 (#2530)
* fix dependencies for py314

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add metadata and CI tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add back gliner

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update error message about python 3.14 availability

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip tests which cannot run on py 3.14

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix lint

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove vllm from py 3.14 deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* safe import for vllm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torch.compile()

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update checkbox results after docling-core changes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cannot run mlx example in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test for rapidocr backends and skip onnxruntime on py3.14

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix other occurances of torch.compile()

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow torch.compile for Python <3.14. proper support will be introduced with new torch releases

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-28 14:32:15 +01:00
Cesar Berrospi Ramis
9a6fdf936b docs: update opensearch notebook and backend documentation (#2519)
* docs(opensearch): update the example notebook RAG with OpenSearch

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs(uspto): remove direct usage of the backend class for conversion

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: remove direct usage of backends from documentation

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-27 10:02:50 +01:00
github-actions[bot]
10c1f06b74 chore: bump version to 2.58.0 [skip ci] v2.58.0 2025-10-22 11:31:29 +00:00
Michele Dolfi
bbe82a68d0 feat(pdf): Support for password-protected PDF documents (#2499)
* add test and example for PDF with password

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use docling-parse with new password feature

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pdfbackendoptions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* generalize backend_options and add PdfBackendOptions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pdf-password option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update exception test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix docs description

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-22 12:48:01 +02:00
Michele Dolfi
89820d01b5 perf: use docling-parse-v4 as default (#2503)
use doclnig-parse-v4 as default

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-21 17:55:43 +02:00
McGuireMark
86556d8367 docs: fix typo in mcp.md (#2502)
Update mcp.md

Typo fix

Signed-off-by: McGuireMark <mark.mcguire@nimblegravity.com>
2025-10-21 17:31:28 +02:00
Cesar Berrospi Ramis
4227fcc3e1 fix(markdown): set the correct discriminator in md backend options (#2501)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-21 14:30:48 +02:00
Legoshi
a30e6a7614 feat(backend): add generic options support and HTML image handling modes (#2011)
* feat: add backend options support to document backends

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat: enhance document backends with generic backend options and improve HTML image handling

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* Refactor tests for declarativebackend

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): improve image caption handling and ensure backend options are set correctly

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix: enhance HTML backend image handling and add support for local file paths

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: Add ground truth data for test data

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): skip loading SVG files in image data handling

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): simplify backend options and address gaps

Backend options for DeclarativeDocumentBackend classes and only when necessary.
Refactor caption parsing in 'img' elements and remove dummy text.
Replace deprecated annotations from Typing library with native types.
Replace typing annotations according to pydantic guidelines.
Some documentation with pydantic annotations.
Fix diff issue with test files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(html): add tests and fix bugs

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): refactor backend options

Move backend option classes to its own module within datamodel package.
Rename 'source_location' with 'source_uri' in HTMLBackendOptions.
Rename 'image_fetch' with 'fetch_images' in HTMLBackendOptions.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(markdown): create a class for the markdown backend options

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-21 12:52:17 +02:00
Richard (Huangrui) Chu
b66624bfff fix(xlsx): speed up by detecting the true last non-empty row/column (#2404)
* Update msexcel_backend.py

Fix #2307, Follow the instruction of https://github.com/docling-project/docling/issues/2307#issuecomment-3327248503.

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* Update msexcel_backend.py

Fix error

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* Fix linting issues

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* Add test files and data (Signed-off-by: Huangrui Chu <huangrui.chu.1999@gmail.com>)

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* resolve conflict with test_backend_msexecl; update the boundary

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* chore(xlsx): use a dataclass to represent a bounding rectangle in worksheets

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(xlsx): increase parsing speed by iterating on 'sheet._cells'

Increase the parsing speed of the spreadsheet backend by iterating on 'sheets._cells'
since this is proportional to the number of created cells.
Rename test file to align it to other test files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-21 08:08:20 +02:00
Ken Steele
657ce8b01c feat(ASR): MLX Whisper Support for Apple Silicon (#2366)
* add mlx-whisper support

* added mlx-whisper example and test. update docling cli to use MLX automatically if present.

* fix pre-commit checks and added proper type safety

* fixed linter issue

* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: a979a680e1dc2fee8461401335cfb5dda8cfdd98
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9827068382ca946fe1387ed83f747ae509fcf229
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: ebbeb45c7dc266260e1fad6bdb54a7041f8aeed4
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 2f6fd3cf46c8ca0bb98810191578278f1df87aa3

Signed-off-by: Ken Steele <ksteele@gmail.com>

* fix unit tests and code coverage for CI

* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 5e61bf11139a2133978db2c8d306be6289aed732

Signed-off-by: Ken Steele <ksteele@gmail.com>

* fix CI example test - mlx_whisper_example.py defaults to tests/data/audio/sample_10s.mp3 if no args specified.

Signed-off-by: Ken Steele <ksteele@gmail.com>

* refactor: centralize audio file extensions and MIME types in base_models.py

- Move audio file extensions from CLI hardcoded set to FormatToExtensions[InputFormat.AUDIO]
- Add support for additional audio formats: m4a, aac, ogg, flac, mp4, avi, mov
- Update FormatToMimeType mapping to include MIME types for all audio formats
- Update CLI auto-detection to use centralized FormatToExtensions mapping
- Add comprehensive tests for audio file auto-detection and pipeline selection
- Ensure explicit pipeline choices are not overridden by auto-detection

Fixes issue where only .mp3 and .wav files were processed as audio despite
CLI auto-detection working for all formats. The document converter now
properly recognizes all audio formats through MIME type detection.

Addresses review comments:
- Centralizes audio extensions in base_models.py as suggested
- Maintains existing auto-detection behavior while using centralized data
- Adds proper test coverage for the audio detection functionality

All examples and tests pass with the new centralized approach.
All audio formats (mp3, wav, m4a, aac, ogg, flac, mp4, avi, mov) now work correctly.

Signed-off-by: Ken Steele <ksteele@gmail.com>

* feat: address reviewer feedback - improve CLI auto-detection and add explicit model options

Review feedback addressed:
1. Fix CLI auto-detection to only switch to ASR pipeline when ALL files are audio
   - Previously switched if ANY file was audio, now requires ALL files to be audio
   - Added warning for mixed file types with guidance to use --pipeline asr

2. Add explicit WHISPER_X_MLX and WHISPER_X_NATIVE model options
   - Users can now force specific implementations if desired
   - Auto-selecting models (WHISPER_BASE, etc.) still choose best for hardware
   - Added 12 new explicit model options: _MLX and _NATIVE variants for each size

CLI now supports:
- Auto-selecting: whisper_tiny, whisper_base, etc. (choose best for hardware)
- Explicit MLX: whisper_tiny_mlx, whisper_base_mlx, etc. (force MLX)
- Explicit Native: whisper_tiny_native, whisper_base_native, etc. (force native)

Addresses reviewer comments from @dolfim-ibm

Signed-off-by: Ken Steele <ksteele@gmail.com>

* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: c60e72d2b5
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 94803317a3
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 21905e8ace
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 96c669d155
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 8371c060ea

Signed-off-by: Ken Steele <ksteele@gmail.com>

* test(asr): add coverage for MLX options, pipeline helpers, and VLM prompts

- tests/test_asr_mlx_whisper.py: verify explicit MLX options (framework, repo ids)
- tests/test_asr_pipeline.py: cover _has_text/_determine_status and backend support with proper InputDocument/NoOpBackend wiring
- tests/test_interfaces.py: add BaseVlmPageModel.formulate_prompt tests (RAW/NONE/CHAT, invalid style), with minimal InlineVlmOptions scaffold

Improves reliability of ASR and VLM components by validating configuration paths and helper logic.

Signed-off-by: Ken Steele <ksteele@gmail.com>

* test(asr): broaden coverage for model selection, pipeline flows, and VLM prompts

- tests/test_asr_mlx_whisper.py
  - Add MLX/native selector coverage across all Whisper sizes
  - Validate repo_id choices under MLX and Native paths
  - Cover fallback path when MPS unavailable and mlx_whisper missing

- tests/test_asr_pipeline.py
  - Relax silent-audio assertion to accept PARTIAL_SUCCESS or SUCCESS
  - Force CPU native path in helper tests to avoid torch in device selection
  - Add language handling tests for native/MLX transcribe
  - Cover native run success (BytesIO) and failure (exception) branches
  - Cover MLX run success/failure branches with mocked transcribe
  - Add init path coverage with artifacts_path

- tests/test_interfaces.py
  - Add focused VLM prompt tests (NONE/CHAT variants)

Result: all tests passing with significantly improved coverage for ASR model selectors, pipeline execution paths, and VLM prompt formulation.

Signed-off-by: Ken Steele <ksteele@gmail.com>

* simplify ASR model settings (no pipeline detection needed)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* clean up disk space in runners

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Ken Steele <ksteele@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-21 08:05:59 +02:00
Michele Dolfi
a5af082d82 chore: fix parsing of release body message (#2498)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-20 13:41:35 +02:00
Michele Dolfi
5be856fbc0 chore: add action posting to discord (#2486)
* add action posting to discord

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test on push

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* with icon

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove testing

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-17 16:31:57 +02:00
Michele Dolfi
dd03b53117 docs: discord badge with join link (#2473)
* add discord link

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add Discord link to social section in mkdocs.yml

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Add Discord link to getting started documentation

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-10-16 10:13:50 +02:00
Michele Dolfi
1762bb8762 chore: update lock (#2468)
update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-15 20:35:49 +02:00
github-actions[bot]
ae61d640c1 chore: bump version to 2.57.0 [skip ci] v2.57.0 2025-10-15 09:20:31 +00:00
Rafael Teixeira de Lima
16829939cf feat(docx): Process drawingml objects in docx (#2453)
* Export of DrawingML figures into docling document

* Adding libreoffice env var and libreoffice to checks image

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fffcad

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Enforcing apt get update

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Only display drawingml warning once per document

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* add util to test libreoffice and exclude files from test when not found

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* check libreoffice only once

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Only initialise converter if needed

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-15 10:58:08 +02:00
Peter W. J. Staar
3e6da2c62d docs: Example on PII obfuscation (#2459)
* added example on PII obfuscation

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatting code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* add in index and fix heading formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add GLINER to PII

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* final commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-14 15:39:16 +02:00