Commit Graph

610 Commits

Author SHA1 Message Date
Christoph Auer
ba50258f70
Merge 425f38a5aa into 98e2fcff63 2025-07-23 14:03:36 +00:00
Christoph Auer
425f38a5aa Clean up unused code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 16:03:25 +02:00
Christoph Auer
de0d9b50a2 Option to enable threadpool with doc_batch_concurrency setting
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 15:52:12 +02:00
Christoph Auer
7b4db1940d Merge branch 'main' of github.com:DS4SD/docling into cau/async-pipeline-and-converter 2025-07-23 15:07:10 +02:00
Copilot
98e2fcff63
fix: Preserve PARTIAL_SUCCESS status when document timeout hits (#1975)
* Initial plan

* Initial investigation: analyze ReadingOrderModel timeout issue

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Complete timeout fix validation with tests and documentation

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Fix timeout status preservation issue by extending _determine_status method

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Fix the PARTIAL_SUCCESS case in _determine_status properly

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 13:50:40 +02:00
Copilot
8d50a59d48
fix: multi-page image support (tiff) (#1928)
* Initial plan

* Fix multi-page TIFF image support

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* add RGB conversion

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove pointless test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add multi-page TIFF test data and verification tests

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Revert "Add multi-page TIFF test data and verification tests"

This reverts commit 130a10e2d9.

* Proper test for 2 page tiff file

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 420df478f3
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: c1d722725f
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 6aa85cc933
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 130a10e2d9
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: d571f36299
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 2aab66288b

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Proper test for 2 page tiff file (2)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 09:55:40 +02:00
github-actions[bot]
ec971bbe68 chore: bump version to 2.42.1 [skip ci] 2025-07-22 16:45:48 +00:00
Christoph Auer
67441ca418
fix: Keep formula clusters also when empty (#1970)
Keep formula clusters also when empty

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-22 17:02:12 +02:00
Michele Dolfi
90a7cc4bdd
docs: enrich existing DoclingDocument (#1969)
add example for enriching an existing doclingdocument

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-22 16:20:15 +02:00
Cesar Berrospi Ramis
a069b1175b
refactor(HTML): handle text from styled html (#1960)
* A new HTML backend that handles styled html (ignors it) as well as images.

Images are parsed as placeholders with a caption, if it exists.

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com>
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>

* tests(HTML): re-enable test_ordered_lists

Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
2025-07-22 13:16:31 +02:00
Fabiano Franz
5d98bcea1b
docs: add documentation for confidence scores (#1912)
* docs: add documentation for confidence scores

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>

* Increase focus on confidence grades, scores are informational only

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>

* Update confidence_scores.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2025-07-21 10:16:17 +02:00
Christoph Auer
c33cc217cd Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:28:13 +02:00
Christoph Auer
558ea957a8 Fix: python3.9 compat
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:17:01 +02:00
Christoph Auer
7762391b3e DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: fa71cde950
I, Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>, hereby add my Signed-off-by to this commit: d66da87d96

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:13:36 +02:00
Christoph Auer
ac9f8e0761 Fix: don't starve on docs with > max_queue_size pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:13:11 +02:00
Christoph Auer
009cc24d0d Fix: don't starve on docs with > max_queue_size pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:11:46 +02:00
Christoph Auer
0579d3a3d2 Fix: don't starve on docs with > max_queue_size pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:11:32 +02:00
github-actions[bot]
7561be537a chore: bump version to 2.42.0 [skip ci] 2025-07-18 15:34:59 +00:00
Christoph Auer
b36ad76b2a Stop accumulating docs in test run
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 17:22:41 +02:00
Ubuntu
d66da87d96 Merge branch 'cau/async-pipeline-and-converter' of github.com:docling-project/docling into cau/async-pipeline-and-converter 2025-07-18 15:19:57 +00:00
Ubuntu
89acdb5db2 Update threaded test
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
2025-07-18 15:18:21 +00:00
Christoph Auer
f6015bf8ae Remove redundant method
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 17:17:24 +02:00
Christoph Auer
fa71cde950 Revert "Unload doc backend"
This reverts commit 01066f0b6e.
2025-07-18 16:54:27 +02:00
Christoph Auer
01066f0b6e Unload doc backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 16:48:35 +02:00
Christoph Auer
988db91bff Reorder test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 15:24:08 +02:00
Christoph Auer
cca05c45ea
fix: Safe pipeline init, use device_map in transformers models (#1917)
* Use device_map for transformer models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add accelerate

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Relax accelerate min version

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make pipeline cache+init thread-safe

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 15:14:36 +02:00
Christoph Auer
33a24848a0 Revise pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 14:33:03 +02:00
Christoph Auer
9fd01f3399 Add test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-17 20:49:37 +02:00
Christoph Auer
04085ba86d Remove unused args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 17:50:38 +02:00
Christoph Auer
4397bb2c44 Pin docling-ibm-models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 17:35:40 +02:00
Christoph Auer
8c905f3e70 Better threaded PDF pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 17:01:33 +02:00
Cesar Berrospi Ramis
e1e3053695
fix: fix HTML table parser and JATS backend bugs (#1948)
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-16 10:49:24 +02:00
Christoph Auer
f98c7e21dd Cleanups and safety improvements
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 10:46:32 +02:00
Christoph Auer
0be9349884 Refactoring into async pipeline primitives and graph
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 10:13:11 +02:00
Christoph Auer
ef25d03bc8 UpstreamAwareQueue
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-15 20:09:05 +02:00
Christoph Auer
f56de726f3 Initial async pdf pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-15 19:25:48 +02:00
stephencox-ict
d6d2dbe2f9
docs: Fix typos (#1943)
Fix typos

Signed-off-by: stephencox-ict <scox@ict.co>
2025-07-15 09:51:56 +02:00
Christoph Auer
a436be7367
feat: Add option to control empty clusters in layout postprocessing (#1940)
Add option to control empty clusters in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-14 18:32:01 +02:00
Copilot
95e70962f1
fix: KeyError: 'fPr' when processing latex fractions in DOCX files (#1926)
* Initial plan

* Initial analysis and fix for KeyError: 'fPr' in OMML fraction processing

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Add comprehensive test for OMML fraction fPr fix

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Use debug logging, remove unnecesary test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-11 09:52:14 +02:00
Copilot
c5fb353f10
fix: Change granite vision model URL from preview to stable version (#1925)
* Initial plan

* Fix granite vision model URL from preview to stable version

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Update to granite vision 3.3

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update to granite vision 3.3 (2)

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
2025-07-11 08:46:03 +02:00
github-actions[bot]
6c4bf9d087 chore: bump version to 2.41.0 [skip ci] 2025-07-10 14:25:05 +00:00
Christoph Auer
cc6193b3b9
test: Update tests to use default PDF backend (DPv4) (#1923)
* Update tests to use default PDF backend (DPv4)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* OCR tests use DPv1 until rotation bugs are fixed

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-10 15:16:56 +02:00
Christoph Auer
2b8616d6d5
feat: Layout model specification and multiple choices (#1910)
* Establish layout_model spec and example instantations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated naming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Back to uppercase constants

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix deps issue with openai-whipser>numba>llvmlite

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pull v1 changed test GT from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-10 06:37:27 +02:00
Panos Vagenas
ec588df971
feat: enable precision control in float serialization (#1914)
* chore: propagate precision control in float serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* parametrize float serialization, propagate core updates

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update test float precision

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* repin docling-core

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-07-09 16:39:17 +02:00
Clément Doumouro
931eb55b88
fix(ocr-utils): unit test and fix the rotate_bounding_box function (#1897)
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-07-08 18:03:29 +02:00
geoHeil
a07ba863c4
feat: add image-text-to-text models in transformers (#1772)
* feat(dolphin): add dolphin support

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* rename

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* reformat

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* fix mypy

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* add prompt style and examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-08 05:54:57 +02:00
VIktor Kuropiantnyk
e25873d557
fix: docs are missing osd packages for tesseract on RHEL (#1905)
Fixed missing packages in the docs on tesseract

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>
2025-07-07 17:06:26 +02:00
Shkarupa Alex
b8813eea80
feat(vlm): Dynamic prompts (#1808)
* Unify temperature options for Vlm models

* Dynamic prompt support with example

* DCO Remediation Commit for Shkarupa Alex <shkarupa.alex@gmail.com>

I, Shkarupa Alex <shkarupa.alex@gmail.com>, hereby add my Signed-off-by to this commit: 34d446cb98
I, Shkarupa Alex <shkarupa.alex@gmail.com>, hereby add my Signed-off-by to this commit: 9c595d574f

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

* Replace Page with SegmentedPage

* Fix example HF repo link 

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Sign-off

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

* DCO Remediation Commit for Shkarupa Alex <shkarupa.alex@gmail.com>

I, Shkarupa Alex <shkarupa.alex@gmail.com>, hereby add my Signed-off-by to this commit: 1a162066dd

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

* Use lmstudio-community model

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Swap inference engine to LM Studio

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

---------

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2025-07-07 16:58:42 +02:00
Michele Dolfi
edd4356aac
fix: use only backend for picture classifier (#1904)
use backend for picture classifier

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-07 16:23:16 +02:00
Michele Dolfi
dd8fde7f19
fix: typo in asr options (#1902)
fix typo

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-07 08:59:14 +02:00