Commit Graph

153 Commits

Author SHA1 Message Date
Cesar Berrospi Ramis
fa3327e1a6 fix(html): preserve code blocks in list items (#2131)
* chore(html): refactor parser to leverage context managers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): parse inline code snippets, also from list items

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(html): remove hidden tags

Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-08-26 06:43:48 +02:00
geoHeil
3f60a0fa78 feat: Upgrade to RapidOCR 3.x (#2088)
* feat: exploring new version

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: autoformat

DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775

* feat: enable configurable runtime for rapidocr and handle new result better;

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: fix linter

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: use new server model

* chore:  change default engine type to onnx

* chore: tests update for new rapidocr

* fix: rebase from main and fix clashes

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 02f9db85f562e5cdfda40c52fee55cfd4030d70a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a7bcb205faedb881f94a89b3bbd29cb31ccd54f0
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a39482a98cbcff7a825c8321134732af0c65930a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63e9d717fa26951566b02761f3fdfc752c31f805
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ef12a6ec1ea2846a8a8e2e776eeaa59c2a0c4dfe

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2222d2340387f8d9d66f3ca9d8e21a0945a44e7a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bc6a1dc507d7f146ec4797a2d3840414f46ac64d
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 56e0d67da7c57d4b5caf8eaef8dff7056c3efd32
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 871ca21271412006c76acf3c19426140efed3d50
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7b1b77159da729d483a581a86c7309acba1712a7
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a792a714a43e19a91b2b782f54621c1c5efda632

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d1fed26323ff829b716bc667fe69532839363e45
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 346ec1cad943765f886e5d17fb0a54221124689c
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4d0bbe5bd6e9f7261b97362ff8823af244267089
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 34a5ad53892a7064a6bf35f890d344d464c78b2f
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 9151959db3ad53535011d1cfdcf9181fdf936bb1
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 8ef5536f2c098826c6c0a05190f8a80614c3f3cb

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7e18637a35
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63fb8ff599
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 0cb9444fb8
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 38940d9978
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: b6d461ac42
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ee55eb3408

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2025-08-25 12:10:33 +02:00
Michele Dolfi
449bde0a6c test: update docx reference results (#2122)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-22 14:26:36 +02:00
Nikhil Verma
3f03709885 fix: Improve numbered list detection for msword docs (#2100)
* Improve numbered list detection for msword docs

This fixes the list detection in MSWord docs by properly tracking and counting
the list entries. It fixes
https://github.com/docling-project/docling/issues/2090

* DCO Remediation Commit for Nikhil Verma <nikhilgotmail@gmail.com>

I, Nikhil Verma <nikhilgotmail@gmail.com>, hereby add my Signed-off-by to this commit: 509da6658e

Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>

---------

Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>
2025-08-22 10:38:34 +02:00
krrome
94fcc46aa9 feat(html): Support formatting tags in HTML texts (#2111)
* add parsing for formatting tags in HTML backend

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

fix latest tests + wiki_duck result files.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* convert _collect_parent_format_tags to staticmethod

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

---------

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
2025-08-22 10:37:34 +02:00
Panos Vagenas
76d2cb76b3 chore: update docling-core lock (#2110)
* chore: pre-check docling-core 2.45.0

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update -core pinning

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-08-20 16:41:48 +02:00
Christoph Auer
5f57ff2a45 perf: Clean up resources with docling-parse v4, no parsed_page output by default (#2105)
* Call PdfDocument.unload_pages from the pipelines where needed, delete parsed_page data unless requested to keep

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* pin docling-parse and update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Reinstate pipeline_options.generate_parsed_page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-20 10:46:31 +02:00
Cesar Berrospi Ramis
c5f2e2fdd6 fix(HTML): parse footer tag as a group in furniture content layer (#2106)
* fix(HTML): parse footer tag as a section in furniture

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): add test for body vs furniture in HTML parser.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-08-20 08:42:25 +02:00
Michele Dolfi
956f82f115 chore: upgrade dependencies in lock file (#2093)
* chore: upgrade lock file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix(markdown): update binary hash of a markdown backend ground truth file

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-08-19 10:11:44 +02:00
Michele Dolfi
31087f3fcc feat: add backend for METS with Google Books profile (#1989)
* add backend for METS with Google Books profile

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Fixes for cell indexing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* use HTMLParser and add options from CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix typing and unloading

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restore guess format

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename inputformat

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use PdfDocumentBackend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use test file from test folder (still missing)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-18 11:43:20 +02:00
krrome
9687297262 feat(html): Support in-line anchor tags in HTML texts (#1659)
* re-implement links for html backend.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* implement hack for images.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

---------

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
2025-08-18 09:57:16 +02:00
Peter W. J. Staar
b09033cb73 feat: add convert_string to document-converter (#2069)
* feat: add convert_string to document-converter

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix unsupported operand type(s) for |: type and NoneType

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added tests for convert_string

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-08-12 11:02:38 +02:00
Cesar Berrospi Ramis
86f70128aa fix(HTML): replace non-standard Unicode characters (#2006)
chore(HTML): replace non-standard Unicode characters for beter downstream tasks

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-29 11:05:35 +02:00
Christoph Auer
aed772ab33 feat: Threaded PDF pipeline (#1951)
* Initial async pdf pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* UpstreamAwareQueue

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Refactoring into async pipeline primitives and graph

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanups and safety improvements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Better threaded PDF pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-ibm-models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unused args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revise pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Unload doc backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revert "Unload doc backend"

This reverts commit 01066f0b6e.

* Remove redundant method

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update threaded test

Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>

* Stop accumulating docs in test run

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: don't starve on docs with > max_queue_size pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: don't starve on docs with > max_queue_size pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>

I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: fa71cde950
I, Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>, hereby add my Signed-off-by to this commit: d66da87d96

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: python3.9 compat

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Option to enable threadpool with doc_batch_concurrency setting

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up unused code

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix settings defaults expectations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use released docling-ibm-models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove ignores for typing/linting

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
2025-07-26 11:49:37 +02:00
Cesar Berrospi Ramis
aec29a7315 fix(markdown): ensure correct parsing of nested lists (#1995)
* fix(markdown): ensure correct parsing of nested lists

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update dependencies in uv.lock file

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-25 15:17:57 +02:00
Christoph Auer
1985841a19 ci: Fixes for test GT (#1992)
Fixes for test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-25 12:28:06 +02:00
Cesar Berrospi Ramis
5132f061a8 fix(HTML): concatenation of child strings in table cells and list items (#1981)
fix(HTML): ensure correct concatenation of child strings in table cells and list items

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-24 11:19:25 +02:00
Rafael Teixeira de Lima
0b83609531 fix(docx): Adding plain latex equations to table cells (#1986)
* Adding plain latex equations to table cells

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding test files

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-07-24 11:02:24 +02:00
Copilot
8d50a59d48 fix: multi-page image support (tiff) (#1928)
* Initial plan

* Fix multi-page TIFF image support

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* add RGB conversion

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove pointless test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add multi-page TIFF test data and verification tests

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

* Revert "Add multi-page TIFF test data and verification tests"

This reverts commit 130a10e2d9.

* Proper test for 2 page tiff file

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 420df478f3
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: c1d722725f
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 6aa85cc933
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 130a10e2d9
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: d571f36299
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 2aab66288b

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Proper test for 2 page tiff file (2)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-23 09:55:40 +02:00
Cesar Berrospi Ramis
a069b1175b refactor(HTML): handle text from styled html (#1960)
* A new HTML backend that handles styled html (ignors it) as well as images.

Images are parsed as placeholders with a caption, if it exists.

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com>
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>

* tests(HTML): re-enable test_ordered_lists

Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
2025-07-22 13:16:31 +02:00
Cesar Berrospi Ramis
e1e3053695 fix: fix HTML table parser and JATS backend bugs (#1948)
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-16 10:49:24 +02:00
Christoph Auer
cc6193b3b9 test: Update tests to use default PDF backend (DPv4) (#1923)
* Update tests to use default PDF backend (DPv4)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* OCR tests use DPv1 until rotation bugs are fixed

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-10 15:16:56 +02:00
Christoph Auer
2b8616d6d5 feat: Layout model specification and multiple choices (#1910)
* Establish layout_model spec and example instantations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated naming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Back to uppercase constants

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix deps issue with openai-whipser>numba>llvmlite

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pull v1 changed test GT from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-10 06:37:27 +02:00
Panos Vagenas
ec588df971 feat: enable precision control in float serialization (#1914)
* chore: propagate precision control in float serialization

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* parametrize float serialization, propagate core updates

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update test float precision

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* repin docling-core

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-07-09 16:39:17 +02:00
Clément Doumouro
931eb55b88 fix(ocr-utils): unit test and fix the rotate_bounding_box function (#1897)
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-07-08 18:03:29 +02:00
Michele Dolfi
edd4356aac fix: use only backend for picture classifier (#1904)
use backend for picture classifier

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-07 16:23:16 +02:00
Christoph Auer
56a0e104f7 feat: Integrate ListItemMarkerProcessor into document assembly (#1825)
* Integrate ListItemMarkerProcessor into document assembly

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update to final version

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update all test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade deps

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-01 10:04:58 +02:00
Christoph Auer
bdfee4e2d0 chore: Safer unloading of DPv4 backend (#1867)
fix: Safer unloading of DPv4 backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-30 14:41:21 +02:00
Panos Vagenas
0533da1923 feat: leverage new list modeling, capture default markers (#1856)
* chore: update docling-core & regenerate test data

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update backends to leverage new list modeling

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* repin docling-core

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* ensure availability of latest docling-core API

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-27 16:37:15 +02:00
Michael Honaker
e79e4f0ab6 fix(markdown): make parsing of rich table cells valid (#1821)
* fix: update md table classification

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>

* Fix ground truth header changes

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>

* Fix merge issues

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>

* Fix minor ground truth errors

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>

---------

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
2025-06-26 19:50:45 +02:00
Panos Vagenas
7c5614a37a fix(markdown): fix single-formatted headings & list items (#1820)
* fix(markdown): fix formatting & inline edge cases (show behavior before change)

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* add change and updated test data

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update lock

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* improve test case

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-25 13:05:06 +02:00
Peter W. J. Staar
1557e7ce3e feat: Support audio input (#1763)
* scaffolding in place

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* doing scaffolding for audio pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* WIP: got first transcription working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* all working, time to start cleaning up

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first working ASR pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added openai-whisper as a first transcription model

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updating with asr_options

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* finalised the first working ASR pipeline with Whisper

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* use whisper from the latest git commit

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update docling/datamodel/pipeline_options.py

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>

* Update docling/datamodel/pipeline_options.py

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>

* updated comment

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* AudioBackend -> DummyBackend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* file rename

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename to NoOpBackend, add test for ASR pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Support every format in NoOpBackend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add missing audio file and test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Install ffmpeg system dependency for ASR test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-23 14:47:26 +02:00
Cesar Berrospi Ramis
d26dac61a8 fix(docx): ensure list items have a list parent (#1827)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-06-20 14:47:25 +02:00
mkrssg
1350a8d3e5 fix(msword_backend): Identify text in the same line after an image #1425 (#1610)
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

* fix: extraneous empty paragraphs for test files

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>

---------

Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>
2025-06-20 10:55:30 +02:00
Panos Vagenas
861abcdcb0 feat(markdown): add formatting & improve inline support (#1804)
feat(markdown): support formatting & hyperlinks

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-18 15:57:57 +02:00
Mahafuzur Rahman
dbab30e92c fix: formula conversion with page_range param set (#1791)
When page_range param is used for formula conversion,
the system throws list index out of range error.

Included tests to validate that the fix works.

Signed-off-by: Masum <masumsofts@yahoo.com>
2025-06-17 13:58:45 +02:00
Martin Wind
f28d23cf03 fix: pptx line break and space handling (#1664)
Signed-off-by: Martin Wind <martin.wind@im-c.at>
2025-06-16 10:44:30 +02:00
Cesar Berrospi Ramis
b886e4df31 fix(asciidoc): set default size when missing in image directive (#1769)
The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values.
Refactor static methods as such and add the staticmethod decorator.
Extend the regression test for this fix.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-06-16 10:38:46 +02:00
Christoph Auer
7d3302cb48 feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745)
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make page.parsed_page the only source of truth for text cells

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Correctly compute PDF boxes from pymupdf

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use different OCR engine order

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add type hints and fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* One more test fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove with pypdfium2_lock from caller sites

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix typing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-13 19:01:55 +02:00
Ayraf
df140227c3 feat: support xlsm files (#1520)
* code for xlsm support

* updated support for xlsm

* updated code for xlsm support

* Update docling_parse_v4_backend.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update docling_parse_v4_backend.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update test_backend_msexcel_xlsm.py

 updated the tests/test_backend_msexcel_xlsm.py:

 have a function starting with test
removed all print statements
** To add an explicit assert {test}=={pred}

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update base_models.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update test_backend_msexcel.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update test_backend_msexcel_xlsm.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update document_converter.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Delete tests/test_backend_msexcel_xlsm.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* xlsm file

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* run tests

* ran tests

* Fix tests, upgrade XSLM example to a valid file

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-10 16:55:59 +02:00
Peter W. J. Staar
6613b9e98b fix: prov for merged-elems (#1728)
* fix: prov for merged-elems

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Reset pyproject.toml

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-10 11:22:42 +02:00
Maras Ioannis
e979750ce9 fix(tesseract): initialize df_osd to avoid uninitialized variable error (#1718)
* fix: initialize df_osd to avoid uninitialized variable error

Signed-off-by: IoannisMaras <maras2002@gmail.com>

* Fix formatting

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Satisfy mypy, regenerate OCR tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: IoannisMaras <maras2002@gmail.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-10 10:57:45 +02:00
Panos Vagenas
61d0d6c755 test: mark flaky test (#1698)
* test: cleanse Word test file

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* mark textbox file test as flaky

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix path usage

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-03 13:13:44 +02:00
Peter W. J. Staar
cfdf4cea25 feat: new vlm-models support (#1570)
* feat: adding new vlm-models support

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the transformers

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* got microsoft/Phi-4-multimodal-instruct to work

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* working on vlm's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring the VLM part

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* all working, now serious refacgtoring necessary

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring the download_model

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the formulate_prompt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* pixtral 12b runs via MLX and native transformers

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the VlmPredictionToken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring minimal_vlm_pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the MyPy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added pipeline_model_specializations file

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* need to get Phi4 working again ...

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* finalising last points for vlms support

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the pipeline for Phi4

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* streamlining all code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixing the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the html backend to the VLM pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the static load_from_doctags

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restore stable imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use AutoModelForVision2Seq for Pixtral and review example (including rename)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove unused value

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* refactor instances of VLM models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip compare example in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use lowercase and uppercase only

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename pipeline_vlm_model_spec

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move more argument to options and simplify model init

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add supported_devices

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove not-needed function

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* exclude minimal_vlm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* missing file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add message for transformers version

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename to specs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use module import and remove MLX from non-darwin

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove hf_vlm_model and add extra_generation_args

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use single HF VLM model class

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torch type

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docs for vision models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 17:01:06 +02:00
Panos Vagenas
1c8a1283c4 test: ensure utf-8 in test data utils (#1691)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-02 12:13:19 +02:00
Cesar Berrospi Ramis
984cb137f6 fix: guess HTML content starting with script tag (#1673)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-06-02 08:43:24 +02:00
Cesar Berrospi Ramis
3942923125 chore: fix or ignore runtime and deprecation warnings (#1660)
* chore: fix or catch deprecation warnings

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update poetry lock with latest docling-core

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-28 17:55:31 +02:00
Cesar Berrospi Ramis
106951e71e test: add missing ground truth files (#1667)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-28 13:26:49 +02:00
Clément Doumouro
45265bf8b1 feat(ocr): auto-detect rotated pages in Tesseract (#1167)
* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
Christoph Auer
90875247e5 feat: Establish confidence estimation for document and pages (#1313)
* Establish confidence field, propagate layout confidence through

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add OCR confidence and parse confidence (stub)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add parse quality rules, use 5% percentile for overall and parse scores

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Heuristic updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix garbage regex

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move grade to page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce mean_score and low_score, consistent aggregate computations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add confidence test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-21 12:32:49 +02:00