* feat: add enum StopReason and use it in VlmPrediction
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* add vlm_inference time for api calls and track stop reason
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* fix: rename enum to VlmStopReason
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* Propagate partial success status if page reaches max tokens
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* feat: page with generation stopped by loop detector create partial success status
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
* Add hint for future improvement
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
* fix: remove vlm_stop_reason from extracted page data, add UNSPECIFIED state as VlmStopReason to avoid null value
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
---------
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
* fix(ocr): use PSM integer values directly instead of constructor
- Use integer psm value directly instead of calling tesserocr.PSM()
- Fixed in both main_psm and script_readers initialization
- tesserocr.PSM is a class with integer constants, not an enum
Fixes#2576
* DCO Remediation Commit for mulgyeol <mulgyeoljung@gmail.com>
I, mulgyeol <mulgyeoljung@gmail.com>, hereby add my Signed-off-by to this commit: da63a17a3c
Signed-off-by: mulgyeol <mulgyeoljung@gmail.com>
---------
Signed-off-by: mulgyeol <mulgyeoljung@gmail.com>
* fix: xlsx doc parsing, now returning values instead of formulas
Signed-off-by: glypt <8trash-can8@protonmail.ch>
* fix: add test for better coverage of xlsx backend
Signed-off-by: glypt <8trash-can8@protonmail.ch>
* fix: add the total of ducks as a formula in the tests/data
This also adds the test that the value 310 is contained in the table.
Without the fix from the previous commit, it would return "B7+C7"
Signed-off-by: glypt <8trash-can8@protonmail.ch>
---------
Signed-off-by: glypt <8trash-can8@protonmail.ch>
* docs(opensearch): update the example notebook RAG with OpenSearch
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs(uspto): remove direct usage of the backend class for conversion
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs: remove direct usage of backends from documentation
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add backend options support to document backends
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: enhance document backends with generic backend options and improve HTML image handling
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Refactor tests for declarativebackend
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): improve image caption handling and ensure backend options are set correctly
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: enhance HTML backend image handling and add support for local file paths
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: Add ground truth data for test data
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): skip loading SVG files in image data handling
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): simplify backend options and address gaps
Backend options for DeclarativeDocumentBackend classes and only when necessary.
Refactor caption parsing in 'img' elements and remove dummy text.
Replace deprecated annotations from Typing library with native types.
Replace typing annotations according to pydantic guidelines.
Some documentation with pydantic annotations.
Fix diff issue with test files.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(html): add tests and fix bugs
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): refactor backend options
Move backend option classes to its own module within datamodel package.
Rename 'source_location' with 'source_uri' in HTMLBackendOptions.
Rename 'image_fetch' with 'fetch_images' in HTMLBackendOptions.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(markdown): create a class for the markdown backend options
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* add mlx-whisper support
* added mlx-whisper example and test. update docling cli to use MLX automatically if present.
* fix pre-commit checks and added proper type safety
* fixed linter issue
* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: a979a680e1dc2fee8461401335cfb5dda8cfdd98
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9827068382ca946fe1387ed83f747ae509fcf229
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: ebbeb45c7dc266260e1fad6bdb54a7041f8aeed4
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 2f6fd3cf46c8ca0bb98810191578278f1df87aa3
Signed-off-by: Ken Steele <ksteele@gmail.com>
* fix unit tests and code coverage for CI
* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 5e61bf11139a2133978db2c8d306be6289aed732
Signed-off-by: Ken Steele <ksteele@gmail.com>
* fix CI example test - mlx_whisper_example.py defaults to tests/data/audio/sample_10s.mp3 if no args specified.
Signed-off-by: Ken Steele <ksteele@gmail.com>
* refactor: centralize audio file extensions and MIME types in base_models.py
- Move audio file extensions from CLI hardcoded set to FormatToExtensions[InputFormat.AUDIO]
- Add support for additional audio formats: m4a, aac, ogg, flac, mp4, avi, mov
- Update FormatToMimeType mapping to include MIME types for all audio formats
- Update CLI auto-detection to use centralized FormatToExtensions mapping
- Add comprehensive tests for audio file auto-detection and pipeline selection
- Ensure explicit pipeline choices are not overridden by auto-detection
Fixes issue where only .mp3 and .wav files were processed as audio despite
CLI auto-detection working for all formats. The document converter now
properly recognizes all audio formats through MIME type detection.
Addresses review comments:
- Centralizes audio extensions in base_models.py as suggested
- Maintains existing auto-detection behavior while using centralized data
- Adds proper test coverage for the audio detection functionality
All examples and tests pass with the new centralized approach.
All audio formats (mp3, wav, m4a, aac, ogg, flac, mp4, avi, mov) now work correctly.
Signed-off-by: Ken Steele <ksteele@gmail.com>
* feat: address reviewer feedback - improve CLI auto-detection and add explicit model options
Review feedback addressed:
1. Fix CLI auto-detection to only switch to ASR pipeline when ALL files are audio
- Previously switched if ANY file was audio, now requires ALL files to be audio
- Added warning for mixed file types with guidance to use --pipeline asr
2. Add explicit WHISPER_X_MLX and WHISPER_X_NATIVE model options
- Users can now force specific implementations if desired
- Auto-selecting models (WHISPER_BASE, etc.) still choose best for hardware
- Added 12 new explicit model options: _MLX and _NATIVE variants for each size
CLI now supports:
- Auto-selecting: whisper_tiny, whisper_base, etc. (choose best for hardware)
- Explicit MLX: whisper_tiny_mlx, whisper_base_mlx, etc. (force MLX)
- Explicit Native: whisper_tiny_native, whisper_base_native, etc. (force native)
Addresses reviewer comments from @dolfim-ibm
Signed-off-by: Ken Steele <ksteele@gmail.com>
* DCO Remediation Commit for Ken Steele <ksteele@gmail.com>
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: c60e72d2b5
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 94803317a3
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 21905e8ace
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 96c669d155
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 8371c060ea
Signed-off-by: Ken Steele <ksteele@gmail.com>
* test(asr): add coverage for MLX options, pipeline helpers, and VLM prompts
- tests/test_asr_mlx_whisper.py: verify explicit MLX options (framework, repo ids)
- tests/test_asr_pipeline.py: cover _has_text/_determine_status and backend support with proper InputDocument/NoOpBackend wiring
- tests/test_interfaces.py: add BaseVlmPageModel.formulate_prompt tests (RAW/NONE/CHAT, invalid style), with minimal InlineVlmOptions scaffold
Improves reliability of ASR and VLM components by validating configuration paths and helper logic.
Signed-off-by: Ken Steele <ksteele@gmail.com>
* test(asr): broaden coverage for model selection, pipeline flows, and VLM prompts
- tests/test_asr_mlx_whisper.py
- Add MLX/native selector coverage across all Whisper sizes
- Validate repo_id choices under MLX and Native paths
- Cover fallback path when MPS unavailable and mlx_whisper missing
- tests/test_asr_pipeline.py
- Relax silent-audio assertion to accept PARTIAL_SUCCESS or SUCCESS
- Force CPU native path in helper tests to avoid torch in device selection
- Add language handling tests for native/MLX transcribe
- Cover native run success (BytesIO) and failure (exception) branches
- Cover MLX run success/failure branches with mocked transcribe
- Add init path coverage with artifacts_path
- tests/test_interfaces.py
- Add focused VLM prompt tests (NONE/CHAT variants)
Result: all tests passing with significantly improved coverage for ASR model selectors, pipeline execution paths, and VLM prompt formulation.
Signed-off-by: Ken Steele <ksteele@gmail.com>
* simplify ASR model settings (no pipeline detection needed)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* clean up disk space in runners
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Ken Steele <ksteele@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* Export of DrawingML figures into docling document
* Adding libreoffice env var and libreoffice to checks image
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fffcad
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Enforcing apt get update
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Only display drawingml warning once per document
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* add util to test libreoffice and exclude files from test when not found
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* check libreoffice only once
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Only initialise converter if needed
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* Use proper page concatentation in VLM pipeline MD/HTML conversion
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Update custom_convert.py
export_to_document_tokens is deprecated so change it to export_to_doctags
Signed-off-by: Jeremy Chen <github@jeremychen.email>
* fix(xlsx): deal with chartsheets in workbooks
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(xlsx): align test file names
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Fix for the proper headers support in rich tables in HTML
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Compatibility with older Python versions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixing Furniture before the first heading rule
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added minimalistic test case
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* added html for the test
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>