Commit Graph

618 Commits

Author SHA1 Message Date
Christoph Auer
4a107f4f57 Adjust example instatiation of multi-stage VLM pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-18 14:36:42 +02:00
Christoph Auer
3d07f1c78e Cleanup hf_transformers_model batching impl
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-18 13:37:46 +02:00
Christoph Auer
fead482e92 Merge from main, include decode_response
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-18 11:29:15 +02:00
Christoph Auer
e372cfe01a Small fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-18 11:12:02 +02:00
krrome
9687297262 feat(html): Support in-line anchor tags in HTML texts (#1659)
* re-implement links for html backend.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* implement hack for images.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

---------

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
2025-08-18 09:57:16 +02:00
Eric Deandrea
76c1fbd6e8 docs: Add docling Quarkus integration (#2083)
* Add docling Quarkus integration

* DCO Remediation Commit for Eric Deandrea <eric.deandrea@ibm.com>

I, Eric Deandrea <eric.deandrea@ibm.com>, hereby add my Signed-off-by to this commit: 86aa0b80f4

Signed-off-by: Eric Deandrea <eric.deandrea@ibm.com>

---------

Signed-off-by: Eric Deandrea <eric.deandrea@ibm.com>
2025-08-18 06:55:51 +02:00
Christoph Auer
f42676aab9 Implement proper batch inference for HuggingFaceTransformersVlmModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-15 17:56:14 +02:00
Christoph Auer
1aa522792a Tweak defaults
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-15 14:49:34 +02:00
Christoph Auer
16fea9cd8b Add VLLM backend support, optimize process_images
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-15 13:30:39 +02:00
Christoph Auer
18b1a43744 Fix KeyboardInterrupt behaviour
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-14 21:11:40 +02:00
Christoph Auer
52b54b21c3 Remove prints
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-14 20:48:34 +02:00
Christoph Auer
c4de11bdb3 Add VLM task interpreters
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-14 20:48:10 +02:00
Christoph Auer
c8737f71da Add VLM task interpreters
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-14 20:44:23 +02:00
Christoph Auer
78c13e1dad Add multithreaded VLM pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-14 19:26:58 +02:00
Christoph Auer
126944c7ee Prepare existing codes for use with new multi-stage VLM pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-13 14:02:19 +02:00
Shkarupa Alex
5f050f94e1 feat(vlm): Ability to preprocess VLM response (#1907)
* Add ability to preprocess VLM response

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

* Move response decoding to vlm options (requires inheritance to override). Per-page prompt formulation also moved to vlm options to keep api consistent.

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

---------

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
2025-08-12 15:20:24 +02:00
github-actions[bot]
ccfee05847 chore: bump version to 2.44.0 [skip ci] v2.44.0 2025-08-12 09:51:35 +00:00
Peter W. J. Staar
b09033cb73 feat: add convert_string to document-converter (#2069)
* feat: add convert_string to document-converter

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix unsupported operand type(s) for |: type and NoneType

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added tests for convert_string

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-08-12 11:02:38 +02:00
Panos Vagenas
e2cca931be docs: add Langflow integration (#2068)
* docs: add langflow integration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix link

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-08-11 16:03:29 +02:00
Maroun Touma
ed56f2de5d fix(html): Parse rawspan and colspan when they include non numerical values (#2048)
* use re to stop at first non-digit

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* Allow digit in first place followed by non numerical values

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* refactor to match type checker

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Maroun Touma <touma@us.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-11 13:53:29 +02:00
Thomas Vitale
bfda6d34d8 docs: Add Arconia integration (#2061)
Signed-off-by: Thomas Vitale <ThomasVitale@users.noreply.github.com>
2025-08-08 09:35:47 +02:00
Michele Dolfi
c5f49dc2db chore: upgrade locked dependencies (#2024)
lock new deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-31 16:05:27 +02:00
TwoLeaves
0130e3ae96 fix: support new mlx-vlm module (#2001)
* fix stream_generate import statement

Signed-off-by: TwoLeaves <ohneherren@gmail.com>

* pin new mlx-vlm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: TwoLeaves <ohneherren@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-31 14:13:17 +02:00
Michele Dolfi
2eb760d060 fix: extend error reporting when verbose logging is enabled (#2017)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-30 11:23:26 +02:00
Cesar Berrospi Ramis
86f70128aa fix(HTML): replace non-standard Unicode characters (#2006)
chore(HTML): replace non-standard Unicode characters for beter downstream tasks

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-29 11:05:35 +02:00
github-actions[bot]
aae42b37a8 chore: bump version to 2.43.0 [skip ci] v2.43.0 2025-07-28 09:45:53 +00:00
Christoph Auer
aed772ab33 feat: Threaded PDF pipeline (#1951)
* Initial async pdf pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* UpstreamAwareQueue

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Refactoring into async pipeline primitives and graph

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanups and safety improvements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Better threaded PDF pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-ibm-models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unused args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revise pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Unload doc backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revert "Unload doc backend"

This reverts commit 01066f0b6e.

* Remove redundant method

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update threaded test

Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>

* Stop accumulating docs in test run

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: don't starve on docs with > max_queue_size pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: don't starve on docs with > max_queue_size pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>

I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: fa71cde950
I, Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>, hereby add my Signed-off-by to this commit: d66da87d96

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: python3.9 compat

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Option to enable threadpool with doc_batch_concurrency setting

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up unused code

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix settings defaults expectations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use released docling-ibm-models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove ignores for typing/linting

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
2025-07-26 11:49:37 +02:00
Cesar Berrospi Ramis
aec29a7315 fix(markdown): ensure correct parsing of nested lists (#1995)
* fix(markdown): ensure correct parsing of nested lists

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update dependencies in uv.lock file

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-25 15:17:57 +02:00
Christoph Auer
1985841a19 ci: Fixes for test GT (#1992)
Fixes for test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-25 12:28:06 +02:00
Cesar Berrospi Ramis
945721a15d fix(HTML): remove an unnecessary print command (#1988)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-25 08:45:15 +02:00
github-actions[bot]
8227841c1b chore: bump version to 2.42.2 [skip ci] v2.42.2 2025-07-24 10:21:10 +00:00
Cesar Berrospi Ramis
5132f061a8 fix(HTML): concatenation of child strings in table cells and list items (#1981)
fix(HTML): ensure correct concatenation of child strings in table cells and list items

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-24 11:19:25 +02:00
Michele Dolfi
7b5f86098d docs: add chat with dosu (#1984)
add chat with dosu

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-24 11:07:36 +02:00
Rafael Teixeira de Lima
0b83609531 fix(docx): Adding plain latex equations to table cells (#1986)
* Adding plain latex equations to table cells

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding test files

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-07-24 11:02:24 +02:00
Copilot
98e2fcff63 fix: Preserve PARTIAL_SUCCESS status when document timeout hits (#1975)
* Initial plan

* Initial investigation: analyze ReadingOrderModel timeout issue

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Complete timeout fix validation with tests and documentation

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Fix timeout status preservation issue by extending _determine_status method

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Fix the PARTIAL_SUCCESS case in _determine_status properly

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 13:50:40 +02:00
Copilot
8d50a59d48 fix: multi-page image support (tiff) (#1928)
* Initial plan

* Fix multi-page TIFF image support

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* add RGB conversion

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove pointless test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add multi-page TIFF test data and verification tests

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Revert "Add multi-page TIFF test data and verification tests"

This reverts commit 130a10e2d9.

* Proper test for 2 page tiff file

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 420df478f3
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: c1d722725f
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 6aa85cc933
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 130a10e2d9
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: d571f36299
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 2aab66288b

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Proper test for 2 page tiff file (2)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 09:55:40 +02:00
github-actions[bot]
ec971bbe68 chore: bump version to 2.42.1 [skip ci] v2.42.1 2025-07-22 16:45:48 +00:00
Christoph Auer
67441ca418 fix: Keep formula clusters also when empty (#1970)
Keep formula clusters also when empty

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-22 17:02:12 +02:00
Michele Dolfi
90a7cc4bdd docs: enrich existing DoclingDocument (#1969)
add example for enriching an existing doclingdocument

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-22 16:20:15 +02:00
Cesar Berrospi Ramis
a069b1175b refactor(HTML): handle text from styled html (#1960)
* A new HTML backend that handles styled html (ignors it) as well as images.

Images are parsed as placeholders with a caption, if it exists.

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com>
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>

* tests(HTML): re-enable test_ordered_lists

Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
2025-07-22 13:16:31 +02:00
Fabiano Franz
5d98bcea1b docs: add documentation for confidence scores (#1912)
* docs: add documentation for confidence scores

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>

* Increase focus on confidence grades, scores are informational only

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>

* Update confidence_scores.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2025-07-21 10:16:17 +02:00
github-actions[bot]
7561be537a chore: bump version to 2.42.0 [skip ci] v2.42.0 2025-07-18 15:34:59 +00:00
Christoph Auer
cca05c45ea fix: Safe pipeline init, use device_map in transformers models (#1917)
* Use device_map for transformer models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add accelerate

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Relax accelerate min version

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make pipeline cache+init thread-safe

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 15:14:36 +02:00
Cesar Berrospi Ramis
e1e3053695 fix: fix HTML table parser and JATS backend bugs (#1948)
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-16 10:49:24 +02:00
stephencox-ict
d6d2dbe2f9 docs: Fix typos (#1943)
Fix typos

Signed-off-by: stephencox-ict <scox@ict.co>
2025-07-15 09:51:56 +02:00
Christoph Auer
a436be7367 feat: Add option to control empty clusters in layout postprocessing (#1940)
Add option to control empty clusters in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-14 18:32:01 +02:00
Copilot
95e70962f1 fix: KeyError: 'fPr' when processing latex fractions in DOCX files (#1926)
* Initial plan

* Initial analysis and fix for KeyError: 'fPr' in OMML fraction processing

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Add comprehensive test for OMML fraction fPr fix

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Use debug logging, remove unnecesary test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-11 09:52:14 +02:00
Copilot
c5fb353f10 fix: Change granite vision model URL from preview to stable version (#1925)
* Initial plan

* Fix granite vision model URL from preview to stable version

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Update to granite vision 3.3

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update to granite vision 3.3 (2)

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
2025-07-11 08:46:03 +02:00
github-actions[bot]
6c4bf9d087 chore: bump version to 2.41.0 [skip ci] v2.41.0 2025-07-10 14:25:05 +00:00
Christoph Auer
cc6193b3b9 test: Update tests to use default PDF backend (DPv4) (#1923)
* Update tests to use default PDF backend (DPv4)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* OCR tests use DPv1 until rotation bugs are fixed

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-10 15:16:56 +02:00