Commit Graph

620 Commits

Author SHA1 Message Date
Christoph Auer
e6d5e4e48f Remove ignores for typing/linting
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-25 12:37:55 +02:00
Christoph Auer
02a7deb882 Merge branch 'main' of github.com:DS4SD/docling into cau/async-pipeline-and-converter 2025-07-25 12:28:31 +02:00
Christoph Auer
1985841a19
ci: Fixes for test GT (#1992)
Fixes for test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-25 12:28:06 +02:00
Cesar Berrospi Ramis
945721a15d
fix(HTML): remove an unnecessary print command (#1988)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-25 08:45:15 +02:00
Christoph Auer
744a013a32 Use released docling-ibm-models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-24 17:01:04 +02:00
Christoph Auer
df257bf90e Fix settings defaults expectations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-24 15:08:35 +02:00
Christoph Auer
4040bd6618 Merge branch 'main' of github.com:DS4SD/docling into cau/async-pipeline-and-converter 2025-07-24 15:07:00 +02:00
github-actions[bot]
8227841c1b chore: bump version to 2.42.2 [skip ci] 2025-07-24 10:21:10 +00:00
Cesar Berrospi Ramis
5132f061a8
fix(HTML): concatenation of child strings in table cells and list items (#1981)
fix(HTML): ensure correct concatenation of child strings in table cells and list items

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-24 11:19:25 +02:00
Michele Dolfi
7b5f86098d
docs: add chat with dosu (#1984)
add chat with dosu

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-24 11:07:36 +02:00
Rafael Teixeira de Lima
0b83609531
fix(docx): Adding plain latex equations to table cells (#1986)
* Adding plain latex equations to table cells

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding test files

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-07-24 11:02:24 +02:00
Christoph Auer
425f38a5aa Clean up unused code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 16:03:25 +02:00
Christoph Auer
de0d9b50a2 Option to enable threadpool with doc_batch_concurrency setting
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 15:52:12 +02:00
Christoph Auer
7b4db1940d Merge branch 'main' of github.com:DS4SD/docling into cau/async-pipeline-and-converter 2025-07-23 15:07:10 +02:00
Copilot
98e2fcff63
fix: Preserve PARTIAL_SUCCESS status when document timeout hits (#1975)
* Initial plan

* Initial investigation: analyze ReadingOrderModel timeout issue

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Complete timeout fix validation with tests and documentation

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Fix timeout status preservation issue by extending _determine_status method

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Fix the PARTIAL_SUCCESS case in _determine_status properly

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 13:50:40 +02:00
Copilot
8d50a59d48
fix: multi-page image support (tiff) (#1928)
* Initial plan

* Fix multi-page TIFF image support

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* add RGB conversion

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove pointless test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add multi-page TIFF test data and verification tests

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Revert "Add multi-page TIFF test data and verification tests"

This reverts commit 130a10e2d9.

* Proper test for 2 page tiff file

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 420df478f3
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: c1d722725f
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 6aa85cc933
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 130a10e2d9
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: d571f36299
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 2aab66288b

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Proper test for 2 page tiff file (2)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 09:55:40 +02:00
github-actions[bot]
ec971bbe68 chore: bump version to 2.42.1 [skip ci] 2025-07-22 16:45:48 +00:00
Christoph Auer
67441ca418
fix: Keep formula clusters also when empty (#1970)
Keep formula clusters also when empty

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-22 17:02:12 +02:00
Michele Dolfi
90a7cc4bdd
docs: enrich existing DoclingDocument (#1969)
add example for enriching an existing doclingdocument

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-22 16:20:15 +02:00
Cesar Berrospi Ramis
a069b1175b
refactor(HTML): handle text from styled html (#1960)
* A new HTML backend that handles styled html (ignors it) as well as images.

Images are parsed as placeholders with a caption, if it exists.

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com>
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>

* tests(HTML): re-enable test_ordered_lists

Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
2025-07-22 13:16:31 +02:00
Fabiano Franz
5d98bcea1b
docs: add documentation for confidence scores (#1912)
* docs: add documentation for confidence scores

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>

* Increase focus on confidence grades, scores are informational only

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>

* Update confidence_scores.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Fabiano Franz <contact@fabianofranz.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2025-07-21 10:16:17 +02:00
Christoph Auer
c33cc217cd Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:28:13 +02:00
Christoph Auer
558ea957a8 Fix: python3.9 compat
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:17:01 +02:00
Christoph Auer
7762391b3e DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: fa71cde950
I, Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>, hereby add my Signed-off-by to this commit: d66da87d96

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:13:36 +02:00
Christoph Auer
ac9f8e0761 Fix: don't starve on docs with > max_queue_size pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:13:11 +02:00
Christoph Auer
009cc24d0d Fix: don't starve on docs with > max_queue_size pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:11:46 +02:00
Christoph Auer
0579d3a3d2 Fix: don't starve on docs with > max_queue_size pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-19 17:11:32 +02:00
github-actions[bot]
7561be537a chore: bump version to 2.42.0 [skip ci] 2025-07-18 15:34:59 +00:00
Christoph Auer
b36ad76b2a Stop accumulating docs in test run
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 17:22:41 +02:00
Ubuntu
d66da87d96 Merge branch 'cau/async-pipeline-and-converter' of github.com:docling-project/docling into cau/async-pipeline-and-converter 2025-07-18 15:19:57 +00:00
Ubuntu
89acdb5db2 Update threaded test
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
2025-07-18 15:18:21 +00:00
Christoph Auer
f6015bf8ae Remove redundant method
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 17:17:24 +02:00
Christoph Auer
fa71cde950 Revert "Unload doc backend"
This reverts commit 01066f0b6e.
2025-07-18 16:54:27 +02:00
Christoph Auer
01066f0b6e Unload doc backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 16:48:35 +02:00
Christoph Auer
988db91bff Reorder test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 15:24:08 +02:00
Christoph Auer
cca05c45ea
fix: Safe pipeline init, use device_map in transformers models (#1917)
* Use device_map for transformer models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add accelerate

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Relax accelerate min version

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make pipeline cache+init thread-safe

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 15:14:36 +02:00
Christoph Auer
33a24848a0 Revise pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-18 14:33:03 +02:00
Christoph Auer
9fd01f3399 Add test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-17 20:49:37 +02:00
Christoph Auer
04085ba86d Remove unused args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 17:50:38 +02:00
Christoph Auer
4397bb2c44 Pin docling-ibm-models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 17:35:40 +02:00
Christoph Auer
8c905f3e70 Better threaded PDF pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 17:01:33 +02:00
Cesar Berrospi Ramis
e1e3053695
fix: fix HTML table parser and JATS backend bugs (#1948)
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-16 10:49:24 +02:00
Christoph Auer
f98c7e21dd Cleanups and safety improvements
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 10:46:32 +02:00
Christoph Auer
0be9349884 Refactoring into async pipeline primitives and graph
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-16 10:13:11 +02:00
Christoph Auer
ef25d03bc8 UpstreamAwareQueue
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-15 20:09:05 +02:00
Christoph Auer
f56de726f3 Initial async pdf pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-15 19:25:48 +02:00
stephencox-ict
d6d2dbe2f9
docs: Fix typos (#1943)
Fix typos

Signed-off-by: stephencox-ict <scox@ict.co>
2025-07-15 09:51:56 +02:00
Christoph Auer
a436be7367
feat: Add option to control empty clusters in layout postprocessing (#1940)
Add option to control empty clusters in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-14 18:32:01 +02:00
Copilot
95e70962f1
fix: KeyError: 'fPr' when processing latex fractions in DOCX files (#1926)
* Initial plan

* Initial analysis and fix for KeyError: 'fPr' in OMML fraction processing

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Add comprehensive test for OMML fraction fPr fix

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Use debug logging, remove unnecesary test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-11 09:52:14 +02:00
Copilot
c5fb353f10
fix: Change granite vision model URL from preview to stable version (#1925)
* Initial plan

* Fix granite vision model URL from preview to stable version

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Update to granite vision 3.3

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update to granite vision 3.3 (2)

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
2025-07-11 08:46:03 +02:00