Commit Graph

559 Commits

Author SHA1 Message Date
Michele Dolfi
6d279f1c41 add docs for vision models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 15:16:23 +02:00
Michele Dolfi
07045386c6 remove torch type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 14:29:03 +02:00
Michele Dolfi
738385004a Merge remote-tracking branch 'origin/main' into dev/add-other-vlm-models 2025-06-02 14:08:23 +02:00
Michele Dolfi
ea5719c39d use single HF VLM model class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 13:25:51 +02:00
Michele Dolfi
8006683007 remove hf_vlm_model and add extra_generation_args
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 12:58:32 +02:00
Edgar Hipp
11ca4f7a7b
docs: fix typo in index.md (#1676)
Signed-off-by: Edgar Hipp <hipp.edg@gmail.com>
2025-06-02 12:35:59 +02:00
Panos Vagenas
1c8a1283c4
test: ensure utf-8 in test data utils (#1691)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-02 12:13:19 +02:00
Michele Dolfi
c0847c97a7 use module import and remove MLX from non-darwin
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 10:45:46 +02:00
Michele Dolfi
b9c1698263 rename to specs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 10:40:06 +02:00
Michele Dolfi
76718cb1f9 add message for transformers version
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 09:55:15 +02:00
Michele Dolfi
3ba698984d Merge remote-tracking branch 'origin/main' into dev/add-other-vlm-models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 08:46:54 +02:00
Cesar Berrospi Ramis
984cb137f6
fix: guess HTML content starting with script tag (#1673)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-06-02 08:43:24 +02:00
Michele Dolfi
55e0703945 missing file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 08:40:04 +02:00
Michele Dolfi
910743a81a exclude minimal_vlm
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 21:15:17 +02:00
Michele Dolfi
ffb7f071c3 remove not-needed function
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 21:13:54 +02:00
Michele Dolfi
7f6df727e3 add supported_devices
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 21:12:43 +02:00
Michele Dolfi
5d21153948 move more argument to options and simplify model init
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 18:49:00 +02:00
Michele Dolfi
3ff1712787 rename pipeline_vlm_model_spec
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 18:29:20 +02:00
Michele Dolfi
2bd15cc809 add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 18:24:04 +02:00
Michele Dolfi
f63312add6 use lowercase and uppercase only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 17:55:16 +02:00
Michele Dolfi
8686842478 skip compare example in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 16:57:48 +02:00
Michele Dolfi
0b2c1d5eda refactor instances of VLM models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 16:55:56 +02:00
Michele Dolfi
fb0d979419 remove unused value
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 16:34:02 +02:00
Michele Dolfi
9dbf08a084 use AutoModelForVision2Seq for Pixtral and review example (including rename)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 16:30:58 +02:00
Michele Dolfi
0cb7520648 restore stable imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-01 09:06:41 +02:00
Cesar Berrospi Ramis
3942923125
chore: fix or ignore runtime and deprecation warnings (#1660)
* chore: fix or catch deprecation warnings

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update poetry lock with latest docling-core

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-28 17:55:31 +02:00
Panos Vagenas
b3e0042813
chore: exclude data from GH Linguist (#1671)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-28 15:42:34 +02:00
Cesar Berrospi Ramis
106951e71e
test: add missing ground truth files (#1667)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-28 13:26:49 +02:00
Peter W. J. Staar
b356b33059
feat: Add visualization of bbox on page with html export. (#1663)
* feat: Add visualization of bbox on page with html export.

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli argument to show_layout

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-28 13:10:38 +02:00
DavidLee
51d3450915
fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte (#1665)
Update document.py

fix: when mime not "application/xml" or "text/plain" raise
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Signed-off-by: DavidLee <yongsheng_li@foxmail.com>
2025-05-27 14:06:05 +02:00
Peter Staar
a4e6777bb3 fixed the merge conflicts
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-23 16:30:18 +02:00
github-actions[bot]
2579d89510 chore: bump version to 2.34.0 [skip ci] 2025-05-22 18:44:45 +00:00
Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() (#1636)
fix ZeroDivisionError for cell_bbox.area()

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-22 13:43:33 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract (#1167)
* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
Christoph Auer
90875247e5
feat: Establish confidence estimation for document and pages (#1313)
* Establish confidence field, propagate layout confidence through

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add OCR confidence and parse confidence (stub)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add parse quality rules, use 5% percentile for overall and parse scores

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Heuristic updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix garbage regex

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move grade to page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce mean_score and low_score, consistent aggregate computations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add confidence test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-21 12:32:49 +02:00
Václav Vančura
14d4f5b109
fix(integration): update the Apify Actor integration (#1619)
* fix(actor): remove references to missing docling_processor.py

Signed-off-by: Václav Vančura <commit@vancura.dev>

* chore(actor): update Actor README.md with recent repo URL changes

Signed-off-by: Václav Vančura <commit@vancura.dev>

* chore(actor): improve the Actor README.md local header link

Signed-off-by: Václav Vančura <commit@vancura.dev>

* chore(actor): bump the Actor version number

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Update .actor/actor.json

Co-authored-by: Marek Trunkát <marek@trunkat.eu>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>

---------

Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
2025-05-21 02:47:55 +02:00
github-actions[bot]
84d0889829 chore: bump version to 2.33.0 [skip ci] 2025-05-20 19:54:51 +00:00
MoheyElDin Badr
f4d9d4111b
fix: Fix issue with detecting docx files, and files with upper case extensions (#1609)
fix detecting files with uppercase extensions

Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com>
2025-05-20 19:42:37 +02:00
Said Gürbüz
0e00a263fa
fix: load_from_doctags static usage (#1617)
* fix load_from_doctags usage

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* update dependencies

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* fix lock file

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* revert lock file

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* update lock file

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

---------

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-20 15:06:12 +02:00
Krishnan
f2e9c0784c
fix: incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371)
* Fix force_backend_text

Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>

* empty commit to retrigger CI

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-20 09:59:38 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549)
get merged_text from boundingbox instead of merging it to prevent overlaps

Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com>
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend (#1538)
* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

---------

Signed-off-by: Andrew <tsai247365@gmail.com>
2025-05-19 15:01:36 +02:00
Peter Staar
374ecd4890 fixed the static load_from_doctags
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 10:58:39 +02:00
Peter Staar
1ada7bfee7 added the html backend to the VLM pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 10:55:27 +02:00
Peter Staar
e93cc3ce09 fixing the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-18 07:38:06 +02:00
Peter Staar
0c7c7c11c2 reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 16:31:11 +02:00
Peter Staar
d5b6c871cf streamlining all code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 16:27:27 +02:00
Peter Staar
661f7c9780 fixed the pipeline for Phi4
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 15:55:49 +02:00
Peter Staar
d41b856961 finalising last points for vlms support
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-16 12:39:26 +02:00
Panos Vagenas
7c4c356e76
chore: fix chunking example data link (#1596)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-16 08:44:47 +02:00