Michele Dolfi
738385004a
Merge remote-tracking branch 'origin/main' into dev/add-other-vlm-models
2025-06-02 14:08:23 +02:00
Michele Dolfi
ea5719c39d
use single HF VLM model class
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 13:25:51 +02:00
Michele Dolfi
8006683007
remove hf_vlm_model and add extra_generation_args
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 12:58:32 +02:00
Edgar Hipp
11ca4f7a7b
docs: fix typo in index.md ( #1676 )
...
Signed-off-by: Edgar Hipp <hipp.edg@gmail.com>
2025-06-02 12:35:59 +02:00
Panos Vagenas
1c8a1283c4
test: ensure utf-8 in test data utils ( #1691 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-02 12:13:19 +02:00
Michele Dolfi
c0847c97a7
use module import and remove MLX from non-darwin
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 10:45:46 +02:00
Michele Dolfi
b9c1698263
rename to specs
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 10:40:06 +02:00
Michele Dolfi
76718cb1f9
add message for transformers version
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 09:55:15 +02:00
Michele Dolfi
3ba698984d
Merge remote-tracking branch 'origin/main' into dev/add-other-vlm-models
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 08:46:54 +02:00
Cesar Berrospi Ramis
984cb137f6
fix: guess HTML content starting with script tag ( #1673 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-06-02 08:43:24 +02:00
Michele Dolfi
55e0703945
missing file
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 08:40:04 +02:00
Michele Dolfi
910743a81a
exclude minimal_vlm
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 21:15:17 +02:00
Michele Dolfi
ffb7f071c3
remove not-needed function
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 21:13:54 +02:00
Michele Dolfi
7f6df727e3
add supported_devices
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 21:12:43 +02:00
Michele Dolfi
5d21153948
move more argument to options and simplify model init
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 18:49:00 +02:00
Michele Dolfi
3ff1712787
rename pipeline_vlm_model_spec
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 18:29:20 +02:00
Michele Dolfi
2bd15cc809
add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 18:24:04 +02:00
Michele Dolfi
f63312add6
use lowercase and uppercase only
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 17:55:16 +02:00
Michele Dolfi
8686842478
skip compare example in CI
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 16:57:48 +02:00
Michele Dolfi
0b2c1d5eda
refactor instances of VLM models
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 16:55:56 +02:00
Michele Dolfi
fb0d979419
remove unused value
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 16:34:02 +02:00
Michele Dolfi
9dbf08a084
use AutoModelForVision2Seq for Pixtral and review example (including rename)
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 16:30:58 +02:00
Michele Dolfi
0cb7520648
restore stable imports
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 09:06:41 +02:00
Cesar Berrospi Ramis
3942923125
chore: fix or ignore runtime and deprecation warnings ( #1660 )
...
* chore: fix or catch deprecation warnings
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: update poetry lock with latest docling-core
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-28 17:55:31 +02:00
Panos Vagenas
b3e0042813
chore: exclude data from GH Linguist ( #1671 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-28 15:42:34 +02:00
Cesar Berrospi Ramis
106951e71e
test: add missing ground truth files ( #1667 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-28 13:26:49 +02:00
Peter W. J. Staar
b356b33059
feat: Add visualization of bbox on page with html export. ( #1663 )
...
* feat: Add visualization of bbox on page with html export.
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the cli argument to show_layout
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-28 13:10:38 +02:00
DavidLee
51d3450915
fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte ( #1665 )
...
Update document.py
fix: when mime not "application/xml" or "text/plain" raise
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
Signed-off-by: DavidLee <yongsheng_li@foxmail.com>
2025-05-27 14:06:05 +02:00
Peter Staar
a4e6777bb3
fixed the merge conflicts
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-23 16:30:18 +02:00
github-actions[bot]
2579d89510
chore: bump version to 2.34.0 [skip ci]
2025-05-22 18:44:45 +00:00
Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() ( #1636 )
...
fix ZeroDivisionError for cell_bbox.area()
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-22 13:43:33 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract ( #1167 )
...
* fix(ocr): tesseract support mis-oriented documents
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): update missing test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): rotate image to the natural orientation before layout prediction
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): move bounding bow rotation util to orientation.py
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): refactor rotation utilities
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): update e2e OCR test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): update e2e OCR test data
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`
* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`
* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation
* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`
---------
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
Christoph Auer
90875247e5
feat: Establish confidence estimation for document and pages ( #1313 )
...
* Establish confidence field, propagate layout confidence through
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add OCR confidence and parse confidence (stub)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add parse quality rules, use 5% percentile for overall and parse scores
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Heuristic updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix garbage regex
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move grade to page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce mean_score and low_score, consistent aggregate computations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add confidence test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-21 12:32:49 +02:00
Václav Vančura
14d4f5b109
fix(integration): update the Apify Actor integration ( #1619 )
...
* fix(actor): remove references to missing docling_processor.py
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): update Actor README.md with recent repo URL changes
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): improve the Actor README.md local header link
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): bump the Actor version number
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Update .actor/actor.json
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
---------
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
2025-05-21 02:47:55 +02:00
github-actions[bot]
84d0889829
chore: bump version to 2.33.0 [skip ci]
2025-05-20 19:54:51 +00:00
MoheyElDin Badr
f4d9d4111b
fix: Fix issue with detecting docx files, and files with upper case extensions ( #1609 )
...
fix detecting files with uppercase extensions
Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com>
2025-05-20 19:42:37 +02:00
Said Gürbüz
0e00a263fa
fix: load_from_doctags static usage ( #1617 )
...
* fix load_from_doctags usage
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* update dependencies
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* fix lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* revert lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* update lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
---------
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-20 15:06:12 +02:00
Krishnan
f2e9c0784c
fix: incorrect force_backend_text behaviour for VLM DocTag pipelines ( #1371 )
...
* Fix force_backend_text
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
* empty commit to retrigger CI
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-20 09:59:38 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com>
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
---------
Signed-off-by: Andrew <tsai247365@gmail.com>
2025-05-19 15:01:36 +02:00
Peter Staar
374ecd4890
fixed the static load_from_doctags
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 10:58:39 +02:00
Peter Staar
1ada7bfee7
added the html backend to the VLM pipeline
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 10:55:27 +02:00
Peter Staar
e93cc3ce09
fixing the tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 07:38:06 +02:00
Peter Staar
0c7c7c11c2
reformatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 16:31:11 +02:00
Peter Staar
d5b6c871cf
streamlining all code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 16:27:27 +02:00
Peter Staar
661f7c9780
fixed the pipeline for Phi4
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 15:55:49 +02:00
Peter Staar
d41b856961
finalising last points for vlms support
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 12:39:26 +02:00
Panos Vagenas
7c4c356e76
chore: fix chunking example data link ( #1596 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-16 08:44:47 +02:00
Peter Staar
fc61258273
merged with main
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-15 07:46:06 +02:00
Peter Staar
e2c95d09bc
need to get Phi4 working again ...
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-15 07:32:55 +02:00