Michele Dolfi
1324eb75fc
add modified test results
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-09-10 08:43:29 +02:00
Peter W. J. Staar
b3d7542061
feat: updated the backend for new docling-parse ( #2187 )
...
* updated the backend and pyproject.toml
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the version and test files
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* forgot to add 1 updated test-file
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-09-05 10:42:31 +02:00
Nikos Livathinos
e38aa0f7f2
feat: Heron layout model as new default ( #1971 )
...
* feat: Switch default layout model to DOCLING_LAYOUT_HERON. Update the unit test data.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* Use default layout model in model_downloader default args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Use default layout model in model_downloader default args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update docling-models tag for TableFormer
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update test GT (from linux CPU)
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal >
* fix: Ensure that the visualisations happen on copies of the page image
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* chore: Pinpoint docling-ibm-models to the fix branch for the ReadingOrderPredictor
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* chore: Update uv.lock
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* chore: Update tests GT to match the Heron layout model and the improved reading order model in Linux
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* fix: Introduce the verify_doctags optional parameter in conversion tests to control if a doctags
comparison should take place. Skip doctags comparisons for certain tests.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* chore: Generate tests GT on Mac
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* chore: Remove the pinning of the docling-ibm-models and use the release 3.9.1
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal >
2025-09-03 12:45:22 +02:00
Qiefan Jiang
a283ccff25
feat(msexcel): set ContentLayer.INVISIBLE for invisible sheet ( #1876 )
...
* feat(msexcel): ignore invisible sheet
* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com >
I, Qiefan Jiang <jiangqiefan@bytedance.com >, hereby add my Signed-off-by to this commit: ca391f4908f44f301de54a97057f0b809f5ce66c
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* retain invisible sheet with ContentLayer.INVISIBLE
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* update UT
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* fix: use Optional for python3.9
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com >
I, Qiefan Jiang <jiangqiefan@bytedance.com >, hereby add my Signed-off-by to this commit: a34371a90e
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
---------
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
2025-09-01 13:53:45 +02:00
Panos Vagenas
be26044f14
chore: update docling-core lock ( #2169 )
...
* chore: upgrade docling-core
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* upgrade lock
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-09-01 13:46:10 +02:00
Cesar Berrospi Ramis
fa3327e1a6
fix(html): preserve code blocks in list items ( #2131 )
...
* chore(html): refactor parser to leverage context managers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* fix(html): parse inline code snippets, also from list items
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* chore(html): remove hidden tags
Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
2025-08-26 06:43:48 +02:00
Michele Dolfi
449bde0a6c
test: update docx reference results ( #2122 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-08-22 14:26:36 +02:00
Nikhil Verma
3f03709885
fix: Improve numbered list detection for msword docs ( #2100 )
...
* Improve numbered list detection for msword docs
This fixes the list detection in MSWord docs by properly tracking and counting
the list entries. It fixes
https://github.com/docling-project/docling/issues/2090
* DCO Remediation Commit for Nikhil Verma <nikhilgotmail@gmail.com >
I, Nikhil Verma <nikhilgotmail@gmail.com >, hereby add my Signed-off-by to this commit: 509da6658e
Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com >
---------
Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com >
2025-08-22 10:38:34 +02:00
krrome
94fcc46aa9
feat(html): Support formatting tags in HTML texts ( #2111 )
...
* add parsing for formatting tags in HTML backend
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
fix latest tests + wiki_duck result files.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
* convert _collect_parent_format_tags to staticmethod
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
2025-08-22 10:37:34 +02:00
Panos Vagenas
76d2cb76b3
chore: update docling-core lock ( #2110 )
...
* chore: pre-check docling-core 2.45.0
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update -core pinning
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-08-20 16:41:48 +02:00
Cesar Berrospi Ramis
c5f2e2fdd6
fix(HTML): parse footer tag as a group in furniture content layer ( #2106 )
...
* fix(HTML): parse footer tag as a section in furniture
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* fix(HTML): add test for body vs furniture in HTML parser.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
2025-08-20 08:42:25 +02:00
Michele Dolfi
956f82f115
chore: upgrade dependencies in lock file ( #2093 )
...
* chore: upgrade lock file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix(markdown): update binary hash of a markdown backend ground truth file
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
2025-08-19 10:11:44 +02:00
Michele Dolfi
31087f3fcc
feat: add backend for METS with Google Books profile ( #1989 )
...
* add backend for METS with Google Books profile
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Fixes for cell indexing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* use HTMLParser and add options from CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix typing and unloading
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* restore guess format
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename inputformat
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use PdfDocumentBackend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use test file from test folder (still missing)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add test file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-08-18 11:43:20 +02:00
krrome
9687297262
feat(html): Support in-line anchor tags in HTML texts ( #1659 )
...
* re-implement links for html backend.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
* implement hack for images.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch >
2025-08-18 09:57:16 +02:00
Cesar Berrospi Ramis
86f70128aa
fix(HTML): replace non-standard Unicode characters ( #2006 )
...
chore(HTML): replace non-standard Unicode characters for beter downstream tasks
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-07-29 11:05:35 +02:00
Cesar Berrospi Ramis
aec29a7315
fix(markdown): ensure correct parsing of nested lists ( #1995 )
...
* fix(markdown): ensure correct parsing of nested lists
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: update dependencies in uv.lock file
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-07-25 15:17:57 +02:00
Christoph Auer
1985841a19
ci: Fixes for test GT ( #1992 )
...
Fixes for test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-07-25 12:28:06 +02:00
Cesar Berrospi Ramis
5132f061a8
fix(HTML): concatenation of child strings in table cells and list items ( #1981 )
...
fix(HTML): ensure correct concatenation of child strings in table cells and list items
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-07-24 11:19:25 +02:00
Rafael Teixeira de Lima
0b83609531
fix(docx): Adding plain latex equations to table cells ( #1986 )
...
* Adding plain latex equations to table cells
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding test files
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-07-24 11:02:24 +02:00
Copilot
8d50a59d48
fix: multi-page image support (tiff) ( #1928 )
...
* Initial plan
* Fix multi-page TIFF image support
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com >
* add RGB conversion
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Remove pointless test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add multi-page TIFF test data and verification tests
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com >
* Revert "Add multi-page TIFF test data and verification tests"
This reverts commit 130a10e2d9 .
* Proper test for 2 page tiff file
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* DCO Remediation Commit for copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >, hereby add my Signed-off-by to this commit: 420df478f3
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >, hereby add my Signed-off-by to this commit: c1d722725f
I, Christoph Auer <cau@zurich.ibm.com >, hereby add my Signed-off-by to this commit: 6aa85cc933
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >, hereby add my Signed-off-by to this commit: 130a10e2d9
I, Christoph Auer <cau@zurich.ibm.com >, hereby add my Signed-off-by to this commit: d571f36299
I, Christoph Auer <cau@zurich.ibm.com >, hereby add my Signed-off-by to this commit: 2aab66288b
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Proper test for 2 page tiff file (2)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-07-23 09:55:40 +02:00
Cesar Berrospi Ramis
a069b1175b
refactor(HTML): handle text from styled html ( #1960 )
...
* A new HTML backend that handles styled html (ignors it) as well as images.
Images are parsed as placeholders with a caption, if it exists.
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com >
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com >
* tests(HTML): re-enable test_ordered_lists
Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com >
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com >
2025-07-22 13:16:31 +02:00
Cesar Berrospi Ramis
e1e3053695
fix: fix HTML table parser and JATS backend bugs ( #1948 )
...
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-07-16 10:49:24 +02:00
Christoph Auer
cc6193b3b9
test: Update tests to use default PDF backend (DPv4) ( #1923 )
...
* Update tests to use default PDF backend (DPv4)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* OCR tests use DPv1 until rotation bugs are fixed
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-07-10 15:16:56 +02:00
Christoph Auer
2b8616d6d5
feat: Layout model specification and multiple choices ( #1910 )
...
* Establish layout_model spec and example instantations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Updated naming
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Back to uppercase constants
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* fix deps issue with openai-whipser>numba>llvmlite
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Pull v1 changed test GT from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-07-10 06:37:27 +02:00
Panos Vagenas
ec588df971
feat: enable precision control in float serialization ( #1914 )
...
* chore: propagate precision control in float serialization
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* parametrize float serialization, propagate core updates
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update test float precision
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* repin docling-core
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-07-09 16:39:17 +02:00
Christoph Auer
56a0e104f7
feat: Integrate ListItemMarkerProcessor into document assembly ( #1825 )
...
* Integrate ListItemMarkerProcessor into document assembly
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update to final version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update all test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Upgrade deps
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-07-01 10:04:58 +02:00
Panos Vagenas
0533da1923
feat: leverage new list modeling, capture default markers ( #1856 )
...
* chore: update docling-core & regenerate test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update backends to leverage new list modeling
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* repin docling-core
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* ensure availability of latest docling-core API
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-27 16:37:15 +02:00
Michael Honaker
e79e4f0ab6
fix(markdown): make parsing of rich table cells valid ( #1821 )
...
* fix: update md table classification
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix ground truth header changes
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix merge issues
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix minor ground truth errors
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
---------
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
2025-06-26 19:50:45 +02:00
Panos Vagenas
7c5614a37a
fix(markdown): fix single-formatted headings & list items ( #1820 )
...
* fix(markdown): fix formatting & inline edge cases (show behavior before change)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* add change and updated test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update lock
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* improve test case
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-25 13:05:06 +02:00
Peter W. J. Staar
1557e7ce3e
feat: Support audio input ( #1763 )
...
* scaffolding in place
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* doing scaffolding for audio pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* WIP: got first transcription working
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* all working, time to start cleaning up
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working ASR pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added openai-whisper as a first transcription model
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updating with asr_options
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* finalised the first working ASR pipeline with Whisper
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* use whisper from the latest git commit
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update docling/datamodel/pipeline_options.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
* Update docling/datamodel/pipeline_options.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
* updated comment
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* AudioBackend -> DummyBackend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* file rename
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Rename to NoOpBackend, add test for ASR pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Support every format in NoOpBackend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add missing audio file and test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Install ffmpeg system dependency for ASR test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-23 14:47:26 +02:00
Cesar Berrospi Ramis
d26dac61a8
fix(docx): ensure list items have a list parent ( #1827 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-20 14:47:25 +02:00
mkrssg
1350a8d3e5
fix(msword_backend): Identify text in the same line after an image #1425 ( #1610 )
...
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* fix: extraneous empty paragraphs for test files
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
---------
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com >
2025-06-20 10:55:30 +02:00
Panos Vagenas
861abcdcb0
feat(markdown): add formatting & improve inline support ( #1804 )
...
feat(markdown): support formatting & hyperlinks
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-18 15:57:57 +02:00
Martin Wind
f28d23cf03
fix: pptx line break and space handling ( #1664 )
...
Signed-off-by: Martin Wind <martin.wind@im-c.at >
2025-06-16 10:44:30 +02:00
Cesar Berrospi Ramis
b886e4df31
fix(asciidoc): set default size when missing in image directive ( #1769 )
...
The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values.
Refactor static methods as such and add the staticmethod decorator.
Extend the regression test for this fix.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-16 10:38:46 +02:00
Christoph Auer
7d3302cb48
feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it ( #1745 )
...
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Make page.parsed_page the only source of truth for text cells
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Small fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Correctly compute PDF boxes from pymupdf
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Use different OCR engine order
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add type hints and fix mypy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* One more test fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Remove with pypdfium2_lock from caller sites
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix typing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-13 19:01:55 +02:00
Ayraf
df140227c3
feat: support xlsm files ( #1520 )
...
* code for xlsm support
* updated support for xlsm
* updated code for xlsm support
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
updated the tests/test_backend_msexcel_xlsm.py:
have a function starting with test
removed all print statements
** To add an explicit assert {test}=={pred}
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update base_models.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update document_converter.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Delete tests/test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* xlsm file
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* run tests
* ran tests
* Fix tests, upgrade XSLM example to a valid file
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 16:55:59 +02:00
Peter W. J. Staar
6613b9e98b
fix: prov for merged-elems ( #1728 )
...
* fix: prov for merged-elems
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Reset pyproject.toml
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 11:22:42 +02:00
Panos Vagenas
61d0d6c755
test: mark flaky test ( #1698 )
...
* test: cleanse Word test file
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* mark textbox file test as flaky
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* fix path usage
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-03 13:13:44 +02:00
Cesar Berrospi Ramis
106951e71e
test: add missing ground truth files ( #1667 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-28 13:26:49 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract ( #1167 )
...
* fix(ocr): tesseract support mis-oriented documents
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): update missing test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): rotate image to the natural orientation before layout prediction
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): move bounding bow rotation util to orientation.py
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): refactor rotation utilities
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): update e2e OCR test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): update e2e OCR test data
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`
* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`
* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation
* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`
---------
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
2025-05-21 18:12:33 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com >
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
---------
Signed-off-by: Andrew <tsai247365@gmail.com >
2025-05-19 15:01:36 +02:00
Elwin
12dab0a1e8
feat: support image/webp file type ( #1415 )
...
* support image/webp file type
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
* docs: add webp image format in supported_formats.md
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
* test: add a test case for `image/webp` file
Signed-off-by: Elwin <hzywong@gmail.com >
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com >
* test: update test case of converting `image/webp` file with more ocr engines
Signed-off-by: Elwin <hzywong@gmail.com >
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com >
* rename test file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-14 09:47:28 +02:00
Michele Dolfi
8baa85a49d
fix: restrict click version and update lock file ( #1582 )
...
* fix click dependency and update lock file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-05-13 10:40:08 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-09 15:14:32 +02:00
Panos Vagenas
de56523974
chore: format JSON test files to enable comparison ( #1511 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-02 10:52:18 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags ( #1436 )
...
* fix(html): handle 'address' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(html): handle 'details' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-23 09:30:59 +02:00
Panos Vagenas
550b1ca2f8
chore: propagate docling-core fix ( #1389 )
...
* chore: propagate docling-core fix
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update lock to latest docling-core release
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-15 10:51:47 +02:00
Peter W. J. Staar
c0ba88edf1
feat(cli): add option for html with split-page mode ( #1355 )
...
* updated the cli to output html in split-page mode
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* add pin for new docling-core with html split argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* relock with fixed html export in docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update more tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update lock with docling-core fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add again chunking extras
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 08:41:50 +02:00