Michele Dolfi
0cb7520648
restore stable imports
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 09:06:41 +02:00
Peter Staar
a4e6777bb3
fixed the merge conflicts
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-23 16:30:18 +02:00
github-actions[bot]
2579d89510
chore: bump version to 2.34.0 [skip ci]
2025-05-22 18:44:45 +00:00
Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() ( #1636 )
...
fix ZeroDivisionError for cell_bbox.area()
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-22 13:43:33 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract ( #1167 )
...
* fix(ocr): tesseract support mis-oriented documents
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): update missing test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): rotate image to the natural orientation before layout prediction
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): move bounding bow rotation util to orientation.py
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): refactor rotation utilities
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): update e2e OCR test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): update e2e OCR test data
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`
* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`
* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation
* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`
---------
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
Christoph Auer
90875247e5
feat: Establish confidence estimation for document and pages ( #1313 )
...
* Establish confidence field, propagate layout confidence through
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add OCR confidence and parse confidence (stub)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add parse quality rules, use 5% percentile for overall and parse scores
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Heuristic updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix garbage regex
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move grade to page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce mean_score and low_score, consistent aggregate computations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add confidence test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-21 12:32:49 +02:00
Václav Vančura
14d4f5b109
fix(integration): update the Apify Actor integration ( #1619 )
...
* fix(actor): remove references to missing docling_processor.py
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): update Actor README.md with recent repo URL changes
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): improve the Actor README.md local header link
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): bump the Actor version number
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Update .actor/actor.json
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
---------
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
2025-05-21 02:47:55 +02:00
github-actions[bot]
84d0889829
chore: bump version to 2.33.0 [skip ci]
2025-05-20 19:54:51 +00:00
MoheyElDin Badr
f4d9d4111b
fix: Fix issue with detecting docx files, and files with upper case extensions ( #1609 )
...
fix detecting files with uppercase extensions
Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com>
2025-05-20 19:42:37 +02:00
Said Gürbüz
0e00a263fa
fix: load_from_doctags static usage ( #1617 )
...
* fix load_from_doctags usage
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* update dependencies
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* fix lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* revert lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* update lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
---------
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-20 15:06:12 +02:00
Krishnan
f2e9c0784c
fix: incorrect force_backend_text behaviour for VLM DocTag pipelines ( #1371 )
...
* Fix force_backend_text
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
* empty commit to retrigger CI
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-20 09:59:38 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com>
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
---------
Signed-off-by: Andrew <tsai247365@gmail.com>
2025-05-19 15:01:36 +02:00
Peter Staar
374ecd4890
fixed the static load_from_doctags
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 10:58:39 +02:00
Peter Staar
1ada7bfee7
added the html backend to the VLM pipeline
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 10:55:27 +02:00
Peter Staar
e93cc3ce09
fixing the tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 07:38:06 +02:00
Peter Staar
0c7c7c11c2
reformatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 16:31:11 +02:00
Peter Staar
d5b6c871cf
streamlining all code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 16:27:27 +02:00
Peter Staar
661f7c9780
fixed the pipeline for Phi4
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 15:55:49 +02:00
Peter Staar
d41b856961
finalising last points for vlms support
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 12:39:26 +02:00
Panos Vagenas
7c4c356e76
chore: fix chunking example data link ( #1596 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-16 08:44:47 +02:00
Peter Staar
fc61258273
merged with main
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-15 07:46:06 +02:00
Peter Staar
e2c95d09bc
need to get Phi4 working again ...
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-15 07:32:55 +02:00
Peter Staar
15a8f328c2
added pipeline_model_specializations file
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-15 05:27:16 +02:00
Peter Staar
7c67d2b2fe
fixed the MyPy
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-14 17:51:43 +02:00
github-actions[bot]
aeb0716bbb
chore: bump version to 2.32.0 [skip ci]
2025-05-14 14:28:21 +00:00
Vinay R Damodaran
3a04f2a367
feat: Improve parallelization for remote services API calls ( #1548 )
...
* Provide the option to make remote services call concurrent
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
* Use yield from correctly?
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
* not do amateur hour stuff
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
---------
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
2025-05-14 15:47:55 +02:00
jimkarag02
9f8b479f17
fix(ocr): orig field in TesseractOcrCliModel as str ( #1553 )
...
fix: ensure orig and text are both strings in TesseractOcrCliModel
Signed-off-by: Dimitris Karagatslis <dimo9.dk@gmail.com>
2025-05-14 15:05:52 +02:00
Panos Vagenas
9f28abf061
docs: add advanced chunking & serialization example ( #1589 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-14 14:35:07 +02:00
Peter Staar
a3716b1961
refactoring minimal_vlm_pipeline
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-14 13:57:32 +02:00
Alex Sokolov
2efb7a7c06
fix(settings): fix nested settings load via environment variables ( #1551 )
...
Signed-off-by: Alexander Sokolov <alsokoloff@gmail.com>
2025-05-14 13:42:10 +02:00
Peter Staar
7c97b494ec
added the VlmPredictionToken
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-14 12:23:46 +02:00
Elwin
12dab0a1e8
feat: support image/webp file type ( #1415 )
...
* support image/webp file type
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>
* docs: add webp image format in supported_formats.md
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>
* test: add a test case for `image/webp` file
Signed-off-by: Elwin <hzywong@gmail.com>
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com>
* test: update test case of converting `image/webp` file with more ocr engines
Signed-off-by: Elwin <hzywong@gmail.com>
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com>
* rename test file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-14 09:47:28 +02:00
Peter Staar
f159075b67
pixtral 12b runs via MLX and native transformers
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-14 07:39:20 +02:00
Peter Staar
054e01d8b3
added the formulate_prompt
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-14 06:26:16 +02:00
Peter Staar
4c0bc61e54
refactoring the download_model
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-14 05:31:54 +02:00
Peter Staar
3407955a47
all working, now serious refacgtoring necessary
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-13 18:23:55 +02:00
github-actions[bot]
23238c241f
chore: bump version to 2.31.2 [skip ci]
2025-05-13 10:09:19 +00:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com>
2025-05-13 11:17:26 +02:00
Michele Dolfi
8baa85a49d
fix: restrict click version and update lock file ( #1582 )
...
* fix click dependency and update lock file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-13 10:40:08 +02:00
Peter Staar
96862bd326
refactoring the VLM part
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-13 10:01:37 +02:00
Peter Staar
ee01e3cff0
Merge branch 'main' into dev/add-other-vlm-models
2025-05-13 06:08:26 +02:00
Peter Staar
7fbe021359
working on vlm's
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-13 06:07:11 +02:00
Peter Staar
77eb21b235
got microsoft/Phi-4-multimodal-instruct to work
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-12 13:37:03 +02:00
Peter Staar
68747e3cad
fixed the transformers
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-12 13:08:33 +02:00
github-actions[bot]
0d0fa6cbe3
chore: bump version to 2.31.1 [skip ci]
2025-05-12 09:44:26 +00:00
Michele Dolfi
127e38646f
fix: add smoldocling in download utils ( #1577 )
...
add smoldocling in download utils
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-12 10:48:07 +02:00
Peter Staar
bd2d01f0ac
Merge branch 'main' into dev/add-other-vlm-models
2025-05-12 08:52:52 +02:00
Oleg Lavrovsky
844babb390
docs: update links in data_prep_kit ( #1559 )
...
Update data_prep_kit.md
The links were broken, since the repository was renamed. I also noticed that PDF2Parquet is now referred to as Docling2Parquet.
Signed-off-by: Oleg Lavrovsky <31819+loleg@users.noreply.github.com>
2025-05-11 20:38:25 +02:00
Peter Staar
18e1ec4df2
feat: adding new vlm-models support
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-11 09:30:10 +02:00