Commit Graph

690 Commits

Author SHA1 Message Date
Cesar Berrospi Ramis
cce18b2ff7 fix: deal with chartsheets in workbooks (#2433)
* fix(xlsx): deal with chartsheets in workbooks

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(xlsx): align test file names

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-10 15:06:38 +02:00
Bruno Pio
f11f8c0a81 feat: Add Tesseract PSM options support (#2411)
* feat: Add Tesseract PSM options support

Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>

* apply formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add tesseract_cli in checks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-10 14:44:30 +02:00
Victor Moreli
ee5501320e fix: skip temporary docx files (#2413)
fix: CLI detects docx temporary files and breaks

Signed-off-by: Victor Moreli <victormoreli64@gmail.com>
2025-10-10 09:39:26 +02:00
pixiake
b5f7fef29b fix: AsrPipeline to handle absolute paths and BytesIO streams correctly (#2407)
Fix AsrPipeline to handle absolute paths and BytesIO streams correctly

Signed-off-by: pixiake <guofeng@spader-ai.com>
Co-authored-by: pixiake <guofeng@spader-ai.com>
2025-10-10 09:37:15 +02:00
Utsav Talwar
f2854b2e1d docs: Add MongoDB + VoyageAI (#2382)
Signed-off-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
Co-authored-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
2025-10-07 14:36:19 -04:00
Michele Dolfi
0610d01afa fix: enrichment of documents without pages metadata (pptx and xlsx) (#2401)
fix logic for pptx and xlsx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-07 18:28:51 +02:00
Maxim Lysak
9705f4020c fix: Proper heading support in rich tables for HTML backend (#2394)
* Fix for the proper headers support in rich tables in HTML

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaning up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Compatibility with older Python versions

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixing Furniture before the first heading rule

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added minimalistic test case

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* added html for the test

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-10-07 15:57:32 +02:00
Utsav Talwar
8a4b946a1a docs: add RAG example with MongoDB Atlas Vector Search and VoyageAI embeddings (#2341)
* Add MongoDB RAG example

* Update MongoDB RAG Example

* Update MongoDB RAG Example

* Update MongoDB RAG Example

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: fbdbf53aa8
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 9b3065ba2b
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 1983f9db35
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 0522aa105d
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: f5a67e8012

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: fbdbf53aa8
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 9b3065ba2b
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 1983f9db35
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 0522aa105d
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: f5a67e8012

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* docs: Add example with MongoDB

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

---------

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>
Signed-off-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
2025-10-03 13:29:43 +02:00
github-actions[bot]
22515b546a chore: bump version to 2.55.1 [skip ci] v2.55.1 2025-10-03 10:26:26 +00:00
Rui Dias Gomes
68230fe7e5 ci: split workflow to speedup CI runtime (#2313)
* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* enable test_e2e_pdfs_conversions

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Rui Dias Gomes <66125272+rmdg88@users.noreply.github.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* fix conflict files

Signed-off-by: rmdg88 <rmdg88@gmail.com>

---------

Signed-off-by: rmdg88 <rmdg88@gmail.com>
Signed-off-by: Rui Dias Gomes <66125272+rmdg88@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 11:16:38 +02:00
Matvei Smirnov
ee73ffae15 fix(markdown): Setext heading support (#2359)
Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>
Co-authored-by: Matvei Smirnov <matvei.smirnov@vkteam.ru>
2025-10-03 10:32:53 +02:00
Hakeem Abbas
246de77d8c fix(docs): fixed the color scheme (#2371)
* fix(docs): fixed the color scheme

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

* fix(docs): colors background

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

---------

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>
2025-10-03 10:20:44 +02:00
Michele Dolfi
a975a790c9 docs: example using Hashicorp Vault PII transform (#2373)
docs: add example using Hashicorp Vault PII transform

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 09:53:29 +02:00
Michele Dolfi
9505202e38 ci: update docling-parse and remove pages.json (#2372)
* update docling-parse and remove pages.json

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* ocr gt

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 09:53:13 +02:00
Christoph Auer
ca2be7ff3a fix: Empty table handling (#2365)
* add table raw cells when no table structure model was used

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add RichTableCell instance for tables with missing structure.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-02 19:35:16 +02:00
Lucas Morin
e6c3b05e63 docs: Jobkit and connectors (#2357)
* feat: create documentation for docling-jobkit

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>

* small text fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-02 13:46:56 +02:00
Michele Dolfi
4f295ed051 fix: add table raw content when no table structure model is used (#1815)
* add table raw cells when no table structure model was used

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add RichTableCell instance for tables with missing structure.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-10-02 13:46:42 +02:00
github-actions[bot]
f0b630e24e chore: bump version to 2.55.0 [skip ci] v2.55.0 2025-09-30 14:50:42 +00:00
Christoph Auer
1e9dc43b72 feat: Repetition-based StoppingCriteria for GraniteDocling (#2323)
* Experimental code for repetition detection, VLLM Streaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update VLLM Streaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update VLLM inference code, CLI and VLM specs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix generation and decoder args for HF model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix vllm device args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bugfixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove streaming VLLM for the moment

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add repetition StoppingCriteria for GraniteDocling/SmolDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make GenerationStopper base class and port for MLX

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add streaming support and custom GenerationStopper support for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix api_image_request_streaming when GenerationStopper triggers.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move DocTagsRepetitionStopper to utility unit, update examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-30 15:26:09 +02:00
Michele Dolfi
68ae7ccf3c fix: pin wider range of typer (#2309)
* pin larger range of typer

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* lock docling-parse 4.5.0

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update results with docling-parse=4.4.0

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-30 08:42:23 +02:00
Christoph Auer
654c70f990 fix: Update Transformers & VLLM inference code, CLI and VLM specs (#2322)
* Update VLLM inference code, CLI and VLM specs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix generation and decoder args for HF model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix vllm device args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bugfixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-29 21:06:54 +02:00
Maxim Lysak
c803abed9a feat: Rich tables support for HTML backend (#2324)
* Rich tables support for HTML backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* updated and added tests

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Refactored parse_table_data in html_backend into few smaller functions

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Changing scope of few functions in html_backend.py, making them static, when possible

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-09-29 18:12:16 +02:00
Hakeem Abbas
325877aee9 docs(styling): update color scheme (#2154)
* update the colors scheme

* update mkdocs.yml

* DCO Remediation Commit for Hakeem Abbas <hakeemsyd@gmail.com>

I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 861cb8ce6e
I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 72539fe5c0

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

* update image

* DCO Remediation Commit for Hakeem Abbas <hakeemsyd@gmail.com>

I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 861cb8ce6e
I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 72539fe5c0
I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 1be2646643

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

* undo image change

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

---------

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>
2025-09-29 11:44:40 +02:00
Luis
a873200c9d docs(vlm): Update SmolDocling to GraniteDocling references (#2315)
Update minimal_vlm_pipeline.py

Signed-off-by: Luis <luis.rojas@ibm.com>
2025-09-25 11:07:39 +02:00
Lucas Morin
9d67bb9ed6 fix: support escaped characters in markdown backend (#2304)
fix: improve markdown backend to support input documents with escaped characters

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>
2025-09-23 18:00:16 +02:00
github-actions[bot]
d599177547 chore: bump version to 2.54.0 [skip ci] v2.54.0 2025-09-22 15:28:30 +00:00
Maxim Lysak
e2482a2ada feat: Rich tables for MSWord backend (#2291)
* Adding support of rich table cells to MSWord backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixes for properly accounting lists, pictures and headers in rich table cells

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up msword backend, re-generated docx tests

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added detection of simple table cells in word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-09-22 16:41:59 +02:00
Cesar Berrospi Ramis
46efaaefee feat: add a backend parser for WebVTT files (#2288)
* feat: add a backend parser for WebVTT files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: update README with VTT support

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: add description to supported formats

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade docling-core to unescape WebVTT in markdown

Pin the new release of docling-core 2.48.2.
Do not escape HTML reserved characters when exporting WebVTT documents to markdown.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: add missing copyright notice

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-22 15:24:34 +02:00
manuflexor
b5628f1227 fix: correct y-axis scaling in draw_table_cells (#2287)
* Fix y axis

* DCO Remediation Commit for manuflexor <imanuel@flexor.ai>

I, manuflexor <imanuel@flexor.ai>, hereby add my Signed-off-by to this commit: cd56622d4f

Signed-off-by: manuflexor <imanuel@flexor.ai>

---------

Signed-off-by: manuflexor <imanuel@flexor.ai>
2025-09-19 13:42:29 +02:00
Christoph Auer
8b7e83a8c7 docs: Update API VLM example with granite-docling (#2294)
chore: Update API VLM example with granite-docling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-19 12:23:53 +02:00
Panos Vagenas
8322c2ea9b docs: fix examples rendering (#2281)
fix examples rendering

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-09-17 20:50:50 -04:00
github-actions[bot]
f1687fb09b chore: bump version to 2.53.0 [skip ci] v2.53.0 2025-09-17 13:59:33 +00:00
Christoph Auer
17afb664d0 feat: Add granite-docling model (#2272)
* adding granite-docling preview

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the model specs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* typo

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use granite-docling and add to the model downloader

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs and README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update final repo_ids for GraniteDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update final repo_ids for GraniteDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix model name in CLI usage example

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Fix VLM model name in README.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-17 15:15:49 +02:00
Mingxuan Zhao
ff351fd40c docs: Describe examples (#2262)
* Update .py examples with clearer guidance,
update out of date imports and calls

Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>

* Fix minimal.py string error, fix ruff format error

Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>

* fix more CI issues

Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>

---------

Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>
2025-09-16 16:00:38 +02:00
dmorady1
0e95171dd6 feat(RapidOcr): Support generic extra arguments for RapidOcr (#2266)
* feat: add support for additional parameters in RapidOcrOptions and fix RapidOcr font_path

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* fix: RapidOcr ensure backwards compatibility and add deprecation note

* add warning log for rec_font_path

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* add tests for code coverage for rapidocr

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* add small comment for test

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ab893b637f

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* fix test  comment

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ab893b637f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 028e332aa9

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

---------

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>
2025-09-16 07:26:10 +02:00
Michele Dolfi
ad2f738231 chore: update lock (#2265)
* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update changes from docling-core update

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-15 11:19:15 +02:00
Yuie.
609d902eef fix: handle empty result from RapidOCR to avoid crash (#2264)
Signed-off-by: Junehyuk Park <yuie@evonit.net>
2025-09-15 10:04:33 +02:00
github-actions[bot]
10bb0aee2d chore: bump version to 2.52.0 [skip ci] v2.52.0 2025-09-11 16:11:20 +00:00
Christoph Auer
0700af212c fix: Add missing features in ThreadedStandardPdfPipeline (#2252)
Add missing features in ThreadedStandardPdfPipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-11 16:26:02 +02:00
Michele Dolfi
2c9123419f feat: enrichment steps on all convert pipelines (incl docx, html, etc) (#2251)
* allow enrichment on all convert pipelines

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* set options in CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-11 15:09:00 +02:00
Michele Dolfi
c6965495a2 fix: address deprecation warnings of dependencies (#2237)
* switch to dtype instead of torch_dtype

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* set __check_model__ to avoid deprecation warnings

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove dataloaders warnings in easyocr

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* suppress with option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-10 14:38:34 +02:00
Cesar Berrospi Ramis
f8cc545bab docs: add an example of RAG with OpenSearch (#2238)
* docs: add an example of RAG with OpeanSearch

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: pin latest docling-core and update uv.lock

Pin latest version release of docling-core in pyproject.toml
Update the dependencies in uv.lock file
Run the notebook rag_opensearch.ipynb to pick up changes from docling-core

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-10 14:37:22 +02:00
Roy Derks
e5cd7020bd docs: Add instructions for using Docling with MCP to README (#2219)
* docs: Add instructions for using Docling with MCP to README

* DCO Remediation Commit for Roy Derks <10717410+royderks@users.noreply.github.com>

Signed-off-by: Roy Derks <roy.derks@ibm.com>

* DCO Remediation Commit for Roy Derks <10717410+royderks@users.noreply.github.com>

I, Roy Derks <10717410+royderks@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 4b9ba1d0ef

Signed-off-by: Roy Derks <roy.derks@ibm.com>

* docs: reorganize documentation on MCP server

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: align README with documentation index page

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Roy Derks <roy.derks@ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Roy Derks <roy.derks@ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-10 10:02:28 +02:00
Tamás Bitai
55f5f3752f docs: Document VLM support requirement in extraction example (#2231)
* docs: Document VLM support requirement in extraction example

* DCO Remediation Commit for Tamás Bitai <bitai.tamas@gmail.com>

I, Tamás Bitai <bitai.tamas@gmail.com>, hereby add my Signed-off-by to this commit: b90defdb77

Signed-off-by: Tamás Bitai <bitai.tamas@gmail.com>

---------

Signed-off-by: Tamás Bitai <bitai.tamas@gmail.com>
2025-09-09 13:45:55 +02:00
github-actions[bot]
df60673992 chore: bump version to 2.51.0 [skip ci] v2.51.0 2025-09-05 13:01:33 +00:00
Peter W. J. Staar
b49d1ad4f1 feat: updating default parameters to get better performance with docling-parse (#2208)
* updated the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the parameters

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-09-05 14:06:21 +02:00
Panos Vagenas
a9f41b088e docs: add information extraction example (#2199)
* docs: add information exctraction example

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update README

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* minor typo

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update README

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-09-05 11:27:09 +02:00
Peter W. J. Staar
b3d7542061 feat: updated the backend for new docling-parse (#2187)
* updated the backend and pyproject.toml

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the version and test files

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* forgot to add 1 updated test-file

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-09-05 10:42:31 +02:00
Alina Ryan
2c3f6faf3d chore: update deprecation note for OcrEngine (#2200)
This commit updates the deprecated note to correctly point to
get_ocr_factory().registered_kind.

Signed-off-by: Alina Ryan <aliryan@redhat.com>
2025-09-05 08:24:14 +02:00
github-actions[bot]
3419c42f10 chore: bump version to 2.50.0 [skip ci] v2.50.0 2025-09-03 11:39:08 +00:00