Commit Graph

774 Commits

Author SHA1 Message Date
Michele Dolfi
268d027c8f feat: Use threading in the standard pipeline and move old behavior to legacy (#2452)
* rename standard to legacy

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove old standard pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move threaded to standard

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add backwards compatible threaded pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Updates for threaded pipeline to lower memory requirements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* updating deps seem to remove the corrupted double-linked list error

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use main lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more threadsafe blocks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename batch_timeout_seconds

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-10-31 14:42:11 +01:00
Welteam
01577e92d1 docs: Update link to Open WebUI docs (#2549)
Fix dead link to Open WebUI docs

Signed-off-by: Welteam <8932313+Welteam@users.noreply.github.com>
2025-10-31 13:21:11 +01:00
Michele Dolfi
cb100437fa docs: Update installation options with extras and review FAQ (#2548)
* revise install docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more FAQ

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-31 13:21:01 +01:00
Yasir Ali
741c44fa45 docs: fix typos (#2546)
docs: fix typos in enrichments.md ('analize' -> 'analyze', 'consise' -> 'concise')

Signed-off-by: Yasir Ali <engr23002@gmail.com>
2025-10-31 10:29:34 +01:00
Michele Dolfi
a51275d080 fix(pdf): threadsafe for pypdfium2 backend (#2527)
* add threadsafe test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test threaded pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test_pypdfium_threaded_pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add more threadsafe blocks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix threadsafe in pypdfium backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove unneccessary tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restore clean test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-30 17:58:39 +01:00
github-actions[bot]
d27fe92e01 chore: bump version to 2.59.0 [skip ci] v2.59.0 2025-10-30 13:05:56 +00:00
Michele Dolfi
97aa06bfbc docs: Add details and examples on optimal GPU setup (#2531)
* docs for GPU optimizations

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* improve time reporting and improve execution

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix standard pipeline

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* tune examples with batch size 64

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add benchmark results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* improve docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* typo in excluded tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* explicit pipeline in table

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-30 13:22:05 +01:00
glypt
d9c90eb45e fix: xlsx cell parsing, now returning values instead of formulas (#2520)
* fix: xlsx doc parsing, now returning values instead of formulas

Signed-off-by: glypt <8trash-can8@protonmail.ch>

* fix: add test for better coverage of xlsx backend

Signed-off-by: glypt <8trash-can8@protonmail.ch>

* fix: add the total of ducks as a formula in the tests/data

This also adds the test that the value 310 is contained in the table.
Without the fix from the previous commit, it would return "B7+C7"

Signed-off-by: glypt <8trash-can8@protonmail.ch>

---------

Signed-off-by: glypt <8trash-can8@protonmail.ch>
2025-10-29 11:35:51 +01:00
peets
b6c892b505 feat(vlm): add num_tokens as attribtue for VlmPrediction (#2489)
* feat: add num_tokens as attribtue for VlmPrediction

* feat: implement tokens tracking for api_vlm

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* DCO Remediation Commit for ElHachem02 <peterelhachem02@gmail.com>

I, ElHachem02 <peterelhachem02@gmail.com>, hereby add my Signed-off-by to this commit: 311287f562

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>

* DCO Remediation Commit for ElHachem02 <peterelhachem02@gmail.com>

I, ElHachem02 <peterelhachem02@gmail.com>, hereby add my Signed-off-by to this commit: 311287f562

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* update return type

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* add time recorder for vlm inference and track generated token ids depending on config

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* update num_tokens to have None as value on exception

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

* set default value of num_tokens to None

Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>

---------

Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Signed-off-by: peets <100425207+ElHachem02@users.noreply.github.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
2025-10-28 17:18:44 +01:00
Michele Dolfi
cdffb47b9a feat: Support for Python 3.14 (#2530)
* fix dependencies for py314

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add metadata and CI tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add back gliner

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update error message about python 3.14 availability

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip tests which cannot run on py 3.14

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix lint

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove vllm from py 3.14 deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* safe import for vllm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torch.compile()

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update checkbox results after docling-core changes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cannot run mlx example in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test for rapidocr backends and skip onnxruntime on py3.14

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix other occurances of torch.compile()

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow torch.compile for Python <3.14. proper support will be introduced with new torch releases

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-28 14:32:15 +01:00
Cesar Berrospi Ramis
9a6fdf936b docs: update opensearch notebook and backend documentation (#2519)
* docs(opensearch): update the example notebook RAG with OpenSearch

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs(uspto): remove direct usage of the backend class for conversion

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: remove direct usage of backends from documentation

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-27 10:02:50 +01:00
github-actions[bot]
10c1f06b74 chore: bump version to 2.58.0 [skip ci] v2.58.0 2025-10-22 11:31:29 +00:00
Michele Dolfi
bbe82a68d0 feat(pdf): Support for password-protected PDF documents (#2499)
* add test and example for PDF with password

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use docling-parse with new password feature

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pdfbackendoptions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* generalize backend_options and add PdfBackendOptions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pdf-password option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update exception test

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix docs description

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-22 12:48:01 +02:00
Michele Dolfi
89820d01b5 perf: use docling-parse-v4 as default (#2503)
use doclnig-parse-v4 as default

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-21 17:55:43 +02:00
McGuireMark
86556d8367 docs: fix typo in mcp.md (#2502)
Update mcp.md

Typo fix

Signed-off-by: McGuireMark <mark.mcguire@nimblegravity.com>
2025-10-21 17:31:28 +02:00
Cesar Berrospi Ramis
4227fcc3e1 fix(markdown): set the correct discriminator in md backend options (#2501)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-21 14:30:48 +02:00
Legoshi
a30e6a7614 feat(backend): add generic options support and HTML image handling modes (#2011)
* feat: add backend options support to document backends

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat: enhance document backends with generic backend options and improve HTML image handling

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* Refactor tests for declarativebackend

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): improve image caption handling and ensure backend options are set correctly

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix: enhance HTML backend image handling and add support for local file paths

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: Add ground truth data for test data

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): skip loading SVG files in image data handling

Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): simplify backend options and address gaps

Backend options for DeclarativeDocumentBackend classes and only when necessary.
Refactor caption parsing in 'img' elements and remove dummy text.
Replace deprecated annotations from Typing library with native types.
Replace typing annotations according to pydantic guidelines.
Some documentation with pydantic annotations.
Fix diff issue with test files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(html): add tests and fix bugs

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(html): refactor backend options

Move backend option classes to its own module within datamodel package.
Rename 'source_location' with 'source_uri' in HTMLBackendOptions.
Rename 'image_fetch' with 'fetch_images' in HTMLBackendOptions.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* refactor(markdown): create a class for the markdown backend options

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-21 12:52:17 +02:00
Richard (Huangrui) Chu
b66624bfff fix(xlsx): speed up by detecting the true last non-empty row/column (#2404)
* Update msexcel_backend.py

Fix #2307, Follow the instruction of https://github.com/docling-project/docling/issues/2307#issuecomment-3327248503.

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* Update msexcel_backend.py

Fix error

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* Fix linting issues

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* Add test files and data (Signed-off-by: Huangrui Chu <huangrui.chu.1999@gmail.com>)

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* resolve conflict with test_backend_msexecl; update the boundary

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

* chore(xlsx): use a dataclass to represent a bounding rectangle in worksheets

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(xlsx): increase parsing speed by iterating on 'sheet._cells'

Increase the parsing speed of the spreadsheet backend by iterating on 'sheets._cells'
since this is proportional to the number of created cells.
Rename test file to align it to other test files.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-21 08:08:20 +02:00
Ken Steele
657ce8b01c feat(ASR): MLX Whisper Support for Apple Silicon (#2366)
* add mlx-whisper support

* added mlx-whisper example and test. update docling cli to use MLX automatically if present.

* fix pre-commit checks and added proper type safety

* fixed linter issue

* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: a979a680e1dc2fee8461401335cfb5dda8cfdd98
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9827068382ca946fe1387ed83f747ae509fcf229
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: ebbeb45c7dc266260e1fad6bdb54a7041f8aeed4
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 2f6fd3cf46c8ca0bb98810191578278f1df87aa3

Signed-off-by: Ken Steele <ksteele@gmail.com>

* fix unit tests and code coverage for CI

* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 5e61bf11139a2133978db2c8d306be6289aed732

Signed-off-by: Ken Steele <ksteele@gmail.com>

* fix CI example test - mlx_whisper_example.py defaults to tests/data/audio/sample_10s.mp3 if no args specified.

Signed-off-by: Ken Steele <ksteele@gmail.com>

* refactor: centralize audio file extensions and MIME types in base_models.py

- Move audio file extensions from CLI hardcoded set to FormatToExtensions[InputFormat.AUDIO]
- Add support for additional audio formats: m4a, aac, ogg, flac, mp4, avi, mov
- Update FormatToMimeType mapping to include MIME types for all audio formats
- Update CLI auto-detection to use centralized FormatToExtensions mapping
- Add comprehensive tests for audio file auto-detection and pipeline selection
- Ensure explicit pipeline choices are not overridden by auto-detection

Fixes issue where only .mp3 and .wav files were processed as audio despite
CLI auto-detection working for all formats. The document converter now
properly recognizes all audio formats through MIME type detection.

Addresses review comments:
- Centralizes audio extensions in base_models.py as suggested
- Maintains existing auto-detection behavior while using centralized data
- Adds proper test coverage for the audio detection functionality

All examples and tests pass with the new centralized approach.
All audio formats (mp3, wav, m4a, aac, ogg, flac, mp4, avi, mov) now work correctly.

Signed-off-by: Ken Steele <ksteele@gmail.com>

* feat: address reviewer feedback - improve CLI auto-detection and add explicit model options

Review feedback addressed:
1. Fix CLI auto-detection to only switch to ASR pipeline when ALL files are audio
   - Previously switched if ANY file was audio, now requires ALL files to be audio
   - Added warning for mixed file types with guidance to use --pipeline asr

2. Add explicit WHISPER_X_MLX and WHISPER_X_NATIVE model options
   - Users can now force specific implementations if desired
   - Auto-selecting models (WHISPER_BASE, etc.) still choose best for hardware
   - Added 12 new explicit model options: _MLX and _NATIVE variants for each size

CLI now supports:
- Auto-selecting: whisper_tiny, whisper_base, etc. (choose best for hardware)
- Explicit MLX: whisper_tiny_mlx, whisper_base_mlx, etc. (force MLX)
- Explicit Native: whisper_tiny_native, whisper_base_native, etc. (force native)

Addresses reviewer comments from @dolfim-ibm

Signed-off-by: Ken Steele <ksteele@gmail.com>

* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: c60e72d2b5
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 94803317a3
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 21905e8ace
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 96c669d155
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 8371c060ea

Signed-off-by: Ken Steele <ksteele@gmail.com>

* test(asr): add coverage for MLX options, pipeline helpers, and VLM prompts

- tests/test_asr_mlx_whisper.py: verify explicit MLX options (framework, repo ids)
- tests/test_asr_pipeline.py: cover _has_text/_determine_status and backend support with proper InputDocument/NoOpBackend wiring
- tests/test_interfaces.py: add BaseVlmPageModel.formulate_prompt tests (RAW/NONE/CHAT, invalid style), with minimal InlineVlmOptions scaffold

Improves reliability of ASR and VLM components by validating configuration paths and helper logic.

Signed-off-by: Ken Steele <ksteele@gmail.com>

* test(asr): broaden coverage for model selection, pipeline flows, and VLM prompts

- tests/test_asr_mlx_whisper.py
  - Add MLX/native selector coverage across all Whisper sizes
  - Validate repo_id choices under MLX and Native paths
  - Cover fallback path when MPS unavailable and mlx_whisper missing

- tests/test_asr_pipeline.py
  - Relax silent-audio assertion to accept PARTIAL_SUCCESS or SUCCESS
  - Force CPU native path in helper tests to avoid torch in device selection
  - Add language handling tests for native/MLX transcribe
  - Cover native run success (BytesIO) and failure (exception) branches
  - Cover MLX run success/failure branches with mocked transcribe
  - Add init path coverage with artifacts_path

- tests/test_interfaces.py
  - Add focused VLM prompt tests (NONE/CHAT variants)

Result: all tests passing with significantly improved coverage for ASR model selectors, pipeline execution paths, and VLM prompt formulation.

Signed-off-by: Ken Steele <ksteele@gmail.com>

* simplify ASR model settings (no pipeline detection needed)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* clean up disk space in runners

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Ken Steele <ksteele@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-21 08:05:59 +02:00
Michele Dolfi
a5af082d82 chore: fix parsing of release body message (#2498)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-20 13:41:35 +02:00
Michele Dolfi
5be856fbc0 chore: add action posting to discord (#2486)
* add action posting to discord

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test on push

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* with icon

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove testing

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-17 16:31:57 +02:00
Michele Dolfi
dd03b53117 docs: discord badge with join link (#2473)
* add discord link

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add Discord link to social section in mkdocs.yml

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Add Discord link to getting started documentation

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-10-16 10:13:50 +02:00
Michele Dolfi
1762bb8762 chore: update lock (#2468)
update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-15 20:35:49 +02:00
github-actions[bot]
ae61d640c1 chore: bump version to 2.57.0 [skip ci] v2.57.0 2025-10-15 09:20:31 +00:00
Rafael Teixeira de Lima
16829939cf feat(docx): Process drawingml objects in docx (#2453)
* Export of DrawingML figures into docling document

* Adding libreoffice env var and libreoffice to checks image

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fffcad

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Enforcing apt get update

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Only display drawingml warning once per document

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* add util to test libreoffice and exclude files from test when not found

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* check libreoffice only once

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Only initialise converter if needed

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-15 10:58:08 +02:00
Peter W. J. Staar
3e6da2c62d docs: Example on PII obfuscation (#2459)
* added example on PII obfuscation

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatting code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* add in index and fix heading formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add GLINER to PII

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* final commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-14 15:39:16 +02:00
Christoph Auer
cd7f7ba145 fix: Use proper page concatentation in VLM pipeline MD/HTML conversion (#2458)
* Use proper page concatentation in VLM pipeline MD/HTML conversion

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-10-14 14:12:26 +02:00
github-actions[bot]
3687d865f8 chore: bump version to 2.56.1 [skip ci] v2.56.1 2025-10-13 16:30:04 +00:00
Michele Dolfi
688a7dfd38 fix: avoid downloading easyocr models by default (#2454)
avoid downloading easyocr models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-13 17:58:06 +02:00
github-actions[bot]
10165dda8a chore: bump version to 2.56.0 [skip ci] v2.56.0 2025-10-13 09:19:06 +00:00
Animesh
db985bb159 fix(asr): Implement robust status check in AsrPipeline (#2442)
* test: Add failing test case for silent audio file

* fix: Implement robust status check in AsrPipeline

* DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com>I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 5fc4d512b330bb0cd347da4cbcca0fbe9687898aI, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 31a4e9a5f1

Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>

* DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com>

I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 5fc4d512b3
I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 31a4e9a5f1

Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>

* DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com>

I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 5fc4d512b3
I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 31a4e9a5f1

Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>

---------

Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>
2025-10-13 09:51:31 +02:00
Jeremy Chen
90200443bc docs: Remove deprecated call in custom_convert.py (#2447)
Update custom_convert.py

export_to_document_tokens is deprecated so change it to export_to_doctags

Signed-off-by: Jeremy Chen <github@jeremychen.email>
2025-10-13 09:30:02 +02:00
Imad Saddik
2a0f56390a docs: fixed a few typos (#2441)
Signed-off-by: Imad Saddik <79410781+ImadSaddik@users.noreply.github.com>
2025-10-13 09:04:50 +02:00
Michele Dolfi
f7244a4333 feat: AutoOCR model selecting the best OCR model available and deprecating the usage of EasyOCR (#2391)
* add auto ocr model

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Apply suggestions from code review

Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* add final log warning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* propagate default options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow rapidocr models download

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove modelscope

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2025-10-10 16:11:39 +02:00
Cesar Berrospi Ramis
cce18b2ff7 fix: deal with chartsheets in workbooks (#2433)
* fix(xlsx): deal with chartsheets in workbooks

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(xlsx): align test file names

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-10 15:06:38 +02:00
Bruno Pio
f11f8c0a81 feat: Add Tesseract PSM options support (#2411)
* feat: Add Tesseract PSM options support

Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>

* apply formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add tesseract_cli in checks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-10 14:44:30 +02:00
Victor Moreli
ee5501320e fix: skip temporary docx files (#2413)
fix: CLI detects docx temporary files and breaks

Signed-off-by: Victor Moreli <victormoreli64@gmail.com>
2025-10-10 09:39:26 +02:00
pixiake
b5f7fef29b fix: AsrPipeline to handle absolute paths and BytesIO streams correctly (#2407)
Fix AsrPipeline to handle absolute paths and BytesIO streams correctly

Signed-off-by: pixiake <guofeng@spader-ai.com>
Co-authored-by: pixiake <guofeng@spader-ai.com>
2025-10-10 09:37:15 +02:00
Utsav Talwar
f2854b2e1d docs: Add MongoDB + VoyageAI (#2382)
Signed-off-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
Co-authored-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
2025-10-07 14:36:19 -04:00
Michele Dolfi
0610d01afa fix: enrichment of documents without pages metadata (pptx and xlsx) (#2401)
fix logic for pptx and xlsx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-07 18:28:51 +02:00
Maxim Lysak
9705f4020c fix: Proper heading support in rich tables for HTML backend (#2394)
* Fix for the proper headers support in rich tables in HTML

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaning up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Compatibility with older Python versions

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixing Furniture before the first heading rule

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added minimalistic test case

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* added html for the test

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-10-07 15:57:32 +02:00
Utsav Talwar
8a4b946a1a docs: add RAG example with MongoDB Atlas Vector Search and VoyageAI embeddings (#2341)
* Add MongoDB RAG example

* Update MongoDB RAG Example

* Update MongoDB RAG Example

* Update MongoDB RAG Example

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: fbdbf53aa8
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 9b3065ba2b
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 1983f9db35
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 0522aa105d
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: f5a67e8012

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: fbdbf53aa8
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 9b3065ba2b
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 1983f9db35
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 0522aa105d
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: f5a67e8012

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* docs: Add example with MongoDB

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

---------

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>
Signed-off-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
2025-10-03 13:29:43 +02:00
github-actions[bot]
22515b546a chore: bump version to 2.55.1 [skip ci] v2.55.1 2025-10-03 10:26:26 +00:00
Rui Dias Gomes
68230fe7e5 ci: split workflow to speedup CI runtime (#2313)
* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* enable test_e2e_pdfs_conversions

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Rui Dias Gomes <66125272+rmdg88@users.noreply.github.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* fix conflict files

Signed-off-by: rmdg88 <rmdg88@gmail.com>

---------

Signed-off-by: rmdg88 <rmdg88@gmail.com>
Signed-off-by: Rui Dias Gomes <66125272+rmdg88@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 11:16:38 +02:00
Matvei Smirnov
ee73ffae15 fix(markdown): Setext heading support (#2359)
Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>
Co-authored-by: Matvei Smirnov <matvei.smirnov@vkteam.ru>
2025-10-03 10:32:53 +02:00
Hakeem Abbas
246de77d8c fix(docs): fixed the color scheme (#2371)
* fix(docs): fixed the color scheme

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

* fix(docs): colors background

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

---------

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>
2025-10-03 10:20:44 +02:00
Michele Dolfi
a975a790c9 docs: example using Hashicorp Vault PII transform (#2373)
docs: add example using Hashicorp Vault PII transform

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 09:53:29 +02:00
Michele Dolfi
9505202e38 ci: update docling-parse and remove pages.json (#2372)
* update docling-parse and remove pages.json

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* ocr gt

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 09:53:13 +02:00
Christoph Auer
ca2be7ff3a fix: Empty table handling (#2365)
* add table raw cells when no table structure model was used

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add RichTableCell instance for tables with missing structure.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-02 19:35:16 +02:00
Lucas Morin
e6c3b05e63 docs: Jobkit and connectors (#2357)
* feat: create documentation for docling-jobkit

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>

* small text fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-02 13:46:56 +02:00