Commit Graph

755 Commits

Author SHA1 Message Date
Michele Dolfi
a5af082d82 chore: fix parsing of release body message (#2498)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-20 13:41:35 +02:00
Michele Dolfi
5be856fbc0 chore: add action posting to discord (#2486)
* add action posting to discord

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* test on push

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* with icon

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove testing

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-17 16:31:57 +02:00
Michele Dolfi
dd03b53117 docs: discord badge with join link (#2473)
* add discord link

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add Discord link to social section in mkdocs.yml

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Add Discord link to getting started documentation

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-10-16 10:13:50 +02:00
Michele Dolfi
1762bb8762 chore: update lock (#2468)
update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-15 20:35:49 +02:00
github-actions[bot]
ae61d640c1 chore: bump version to 2.57.0 [skip ci] v2.57.0 2025-10-15 09:20:31 +00:00
Rafael Teixeira de Lima
16829939cf feat(docx): Process drawingml objects in docx (#2453)
* Export of DrawingML figures into docling document

* Adding libreoffice env var and libreoffice to checks image

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* DCO Remediation Commit for Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fffcad

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Enforcing apt get update

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Only display drawingml warning once per document

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* add util to test libreoffice and exclude files from test when not found

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* check libreoffice only once

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Only initialise converter if needed

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-15 10:58:08 +02:00
Peter W. J. Staar
3e6da2c62d docs: Example on PII obfuscation (#2459)
* added example on PII obfuscation

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatting code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* add in index and fix heading formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add GLINER to PII

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* final commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-14 15:39:16 +02:00
Christoph Auer
cd7f7ba145 fix: Use proper page concatentation in VLM pipeline MD/HTML conversion (#2458)
* Use proper page concatentation in VLM pipeline MD/HTML conversion

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-10-14 14:12:26 +02:00
github-actions[bot]
3687d865f8 chore: bump version to 2.56.1 [skip ci] v2.56.1 2025-10-13 16:30:04 +00:00
Michele Dolfi
688a7dfd38 fix: avoid downloading easyocr models by default (#2454)
avoid downloading easyocr models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-13 17:58:06 +02:00
github-actions[bot]
10165dda8a chore: bump version to 2.56.0 [skip ci] v2.56.0 2025-10-13 09:19:06 +00:00
Animesh
db985bb159 fix(asr): Implement robust status check in AsrPipeline (#2442)
* test: Add failing test case for silent audio file

* fix: Implement robust status check in AsrPipeline

* DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com>I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 5fc4d512b330bb0cd347da4cbcca0fbe9687898aI, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 31a4e9a5f1

Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>

* DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com>

I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 5fc4d512b3
I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 31a4e9a5f1

Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>

* DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com>

I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 5fc4d512b3
I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 31a4e9a5f1

Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>

---------

Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>
2025-10-13 09:51:31 +02:00
Jeremy Chen
90200443bc docs: Remove deprecated call in custom_convert.py (#2447)
Update custom_convert.py

export_to_document_tokens is deprecated so change it to export_to_doctags

Signed-off-by: Jeremy Chen <github@jeremychen.email>
2025-10-13 09:30:02 +02:00
Imad Saddik
2a0f56390a docs: fixed a few typos (#2441)
Signed-off-by: Imad Saddik <79410781+ImadSaddik@users.noreply.github.com>
2025-10-13 09:04:50 +02:00
Michele Dolfi
f7244a4333 feat: AutoOCR model selecting the best OCR model available and deprecating the usage of EasyOCR (#2391)
* add auto ocr model

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Apply suggestions from code review

Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* add final log warning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* propagate default options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow rapidocr models download

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove modelscope

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2025-10-10 16:11:39 +02:00
Cesar Berrospi Ramis
cce18b2ff7 fix: deal with chartsheets in workbooks (#2433)
* fix(xlsx): deal with chartsheets in workbooks

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* tests(xlsx): align test file names

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-10-10 15:06:38 +02:00
Bruno Pio
f11f8c0a81 feat: Add Tesseract PSM options support (#2411)
* feat: Add Tesseract PSM options support

Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>

* apply formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add tesseract_cli in checks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-10 14:44:30 +02:00
Victor Moreli
ee5501320e fix: skip temporary docx files (#2413)
fix: CLI detects docx temporary files and breaks

Signed-off-by: Victor Moreli <victormoreli64@gmail.com>
2025-10-10 09:39:26 +02:00
pixiake
b5f7fef29b fix: AsrPipeline to handle absolute paths and BytesIO streams correctly (#2407)
Fix AsrPipeline to handle absolute paths and BytesIO streams correctly

Signed-off-by: pixiake <guofeng@spader-ai.com>
Co-authored-by: pixiake <guofeng@spader-ai.com>
2025-10-10 09:37:15 +02:00
Utsav Talwar
f2854b2e1d docs: Add MongoDB + VoyageAI (#2382)
Signed-off-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
Co-authored-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
2025-10-07 14:36:19 -04:00
Michele Dolfi
0610d01afa fix: enrichment of documents without pages metadata (pptx and xlsx) (#2401)
fix logic for pptx and xlsx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-07 18:28:51 +02:00
Maxim Lysak
9705f4020c fix: Proper heading support in rich tables for HTML backend (#2394)
* Fix for the proper headers support in rich tables in HTML

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaning up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Compatibility with older Python versions

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixing Furniture before the first heading rule

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added minimalistic test case

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* added html for the test

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-10-07 15:57:32 +02:00
Utsav Talwar
8a4b946a1a docs: add RAG example with MongoDB Atlas Vector Search and VoyageAI embeddings (#2341)
* Add MongoDB RAG example

* Update MongoDB RAG Example

* Update MongoDB RAG Example

* Update MongoDB RAG Example

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: fbdbf53aa8
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 9b3065ba2b
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 1983f9db35
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 0522aa105d
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: f5a67e8012

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: fbdbf53aa8
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 9b3065ba2b
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 1983f9db35
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 0522aa105d
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: f5a67e8012

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* docs: Add example with MongoDB

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

* DCO Remediation Commit for utsavMongoDB <utsav.talwar@mongodb.com>

I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: bb245a31ed
I, utsavMongoDB <utsav.talwar@mongodb.com>, hereby add my Signed-off-by to this commit: 25436e543c

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>

---------

Signed-off-by: utsavMongoDB <utsav.talwar@mongodb.com>
Signed-off-by: Utsav Talwar <114057324+utsavMongoDB@users.noreply.github.com>
2025-10-03 13:29:43 +02:00
github-actions[bot]
22515b546a chore: bump version to 2.55.1 [skip ci] v2.55.1 2025-10-03 10:26:26 +00:00
Rui Dias Gomes
68230fe7e5 ci: split workflow to speedup CI runtime (#2313)
* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* enable test_e2e_pdfs_conversions

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Rui Dias Gomes <66125272+rmdg88@users.noreply.github.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* split workflow

Signed-off-by: rmdg88 <rmdg88@gmail.com>

* fix conflict files

Signed-off-by: rmdg88 <rmdg88@gmail.com>

---------

Signed-off-by: rmdg88 <rmdg88@gmail.com>
Signed-off-by: Rui Dias Gomes <66125272+rmdg88@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 11:16:38 +02:00
Matvei Smirnov
ee73ffae15 fix(markdown): Setext heading support (#2359)
Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>
Co-authored-by: Matvei Smirnov <matvei.smirnov@vkteam.ru>
2025-10-03 10:32:53 +02:00
Hakeem Abbas
246de77d8c fix(docs): fixed the color scheme (#2371)
* fix(docs): fixed the color scheme

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

* fix(docs): colors background

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

---------

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>
2025-10-03 10:20:44 +02:00
Michele Dolfi
a975a790c9 docs: example using Hashicorp Vault PII transform (#2373)
docs: add example using Hashicorp Vault PII transform

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 09:53:29 +02:00
Michele Dolfi
9505202e38 ci: update docling-parse and remove pages.json (#2372)
* update docling-parse and remove pages.json

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* ocr gt

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-03 09:53:13 +02:00
Christoph Auer
ca2be7ff3a fix: Empty table handling (#2365)
* add table raw cells when no table structure model was used

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add RichTableCell instance for tables with missing structure.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-02 19:35:16 +02:00
Lucas Morin
e6c3b05e63 docs: Jobkit and connectors (#2357)
* feat: create documentation for docling-jobkit

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>

* small text fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-02 13:46:56 +02:00
Michele Dolfi
4f295ed051 fix: add table raw content when no table structure model is used (#1815)
* add table raw cells when no table structure model was used

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add RichTableCell instance for tables with missing structure.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-10-02 13:46:42 +02:00
github-actions[bot]
f0b630e24e chore: bump version to 2.55.0 [skip ci] v2.55.0 2025-09-30 14:50:42 +00:00
Christoph Auer
1e9dc43b72 feat: Repetition-based StoppingCriteria for GraniteDocling (#2323)
* Experimental code for repetition detection, VLLM Streaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update VLLM Streaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update VLLM inference code, CLI and VLM specs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix generation and decoder args for HF model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix vllm device args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bugfixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove streaming VLLM for the moment

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add repetition StoppingCriteria for GraniteDocling/SmolDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make GenerationStopper base class and port for MLX

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add streaming support and custom GenerationStopper support for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix api_image_request_streaming when GenerationStopper triggers.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move DocTagsRepetitionStopper to utility unit, update examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-30 15:26:09 +02:00
Michele Dolfi
68ae7ccf3c fix: pin wider range of typer (#2309)
* pin larger range of typer

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* lock docling-parse 4.5.0

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update results with docling-parse=4.4.0

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-30 08:42:23 +02:00
Christoph Auer
654c70f990 fix: Update Transformers & VLLM inference code, CLI and VLM specs (#2322)
* Update VLLM inference code, CLI and VLM specs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix generation and decoder args for HF model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix vllm device args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bugfixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-29 21:06:54 +02:00
Maxim Lysak
c803abed9a feat: Rich tables support for HTML backend (#2324)
* Rich tables support for HTML backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* updated and added tests

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Refactored parse_table_data in html_backend into few smaller functions

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Changing scope of few functions in html_backend.py, making them static, when possible

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-09-29 18:12:16 +02:00
Hakeem Abbas
325877aee9 docs(styling): update color scheme (#2154)
* update the colors scheme

* update mkdocs.yml

* DCO Remediation Commit for Hakeem Abbas <hakeemsyd@gmail.com>

I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 861cb8ce6e
I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 72539fe5c0

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

* update image

* DCO Remediation Commit for Hakeem Abbas <hakeemsyd@gmail.com>

I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 861cb8ce6e
I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 72539fe5c0
I, Hakeem Abbas <hakeemsyd@gmail.com>, hereby add my Signed-off-by to this commit: 1be2646643

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

* undo image change

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>

---------

Signed-off-by: Hakeem Abbas <hakeemsyd@gmail.com>
2025-09-29 11:44:40 +02:00
Luis
a873200c9d docs(vlm): Update SmolDocling to GraniteDocling references (#2315)
Update minimal_vlm_pipeline.py

Signed-off-by: Luis <luis.rojas@ibm.com>
2025-09-25 11:07:39 +02:00
Lucas Morin
9d67bb9ed6 fix: support escaped characters in markdown backend (#2304)
fix: improve markdown backend to support input documents with escaped characters

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>
2025-09-23 18:00:16 +02:00
github-actions[bot]
d599177547 chore: bump version to 2.54.0 [skip ci] v2.54.0 2025-09-22 15:28:30 +00:00
Maxim Lysak
e2482a2ada feat: Rich tables for MSWord backend (#2291)
* Adding support of rich table cells to MSWord backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixes for properly accounting lists, pictures and headers in rich table cells

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up msword backend, re-generated docx tests

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added detection of simple table cells in word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-09-22 16:41:59 +02:00
Cesar Berrospi Ramis
46efaaefee feat: add a backend parser for WebVTT files (#2288)
* feat: add a backend parser for WebVTT files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: update README with VTT support

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: add description to supported formats

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade docling-core to unescape WebVTT in markdown

Pin the new release of docling-core 2.48.2.
Do not escape HTML reserved characters when exporting WebVTT documents to markdown.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: add missing copyright notice

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-22 15:24:34 +02:00
manuflexor
b5628f1227 fix: correct y-axis scaling in draw_table_cells (#2287)
* Fix y axis

* DCO Remediation Commit for manuflexor <imanuel@flexor.ai>

I, manuflexor <imanuel@flexor.ai>, hereby add my Signed-off-by to this commit: cd56622d4f

Signed-off-by: manuflexor <imanuel@flexor.ai>

---------

Signed-off-by: manuflexor <imanuel@flexor.ai>
2025-09-19 13:42:29 +02:00
Christoph Auer
8b7e83a8c7 docs: Update API VLM example with granite-docling (#2294)
chore: Update API VLM example with granite-docling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-19 12:23:53 +02:00
Panos Vagenas
8322c2ea9b docs: fix examples rendering (#2281)
fix examples rendering

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-09-17 20:50:50 -04:00
github-actions[bot]
f1687fb09b chore: bump version to 2.53.0 [skip ci] v2.53.0 2025-09-17 13:59:33 +00:00
Christoph Auer
17afb664d0 feat: Add granite-docling model (#2272)
* adding granite-docling preview

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the model specs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* typo

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use granite-docling and add to the model downloader

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs and README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update final repo_ids for GraniteDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update final repo_ids for GraniteDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix model name in CLI usage example

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Fix VLM model name in README.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-17 15:15:49 +02:00
Mingxuan Zhao
ff351fd40c docs: Describe examples (#2262)
* Update .py examples with clearer guidance,
update out of date imports and calls

Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>

* Fix minimal.py string error, fix ruff format error

Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>

* fix more CI issues

Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>

---------

Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>
2025-09-16 16:00:38 +02:00
dmorady1
0e95171dd6 feat(RapidOcr): Support generic extra arguments for RapidOcr (#2266)
* feat: add support for additional parameters in RapidOcrOptions and fix RapidOcr font_path

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* fix: RapidOcr ensure backwards compatibility and add deprecation note

* add warning log for rec_font_path

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* add tests for code coverage for rapidocr

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* add small comment for test

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ab893b637f

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* fix test  comment

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ab893b637f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 028e332aa9

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

---------

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>
2025-09-16 07:26:10 +02:00