Commit Graph

382 Commits

Author SHA1 Message Date
pixiake
b5f7fef29b fix: AsrPipeline to handle absolute paths and BytesIO streams correctly (#2407)
Fix AsrPipeline to handle absolute paths and BytesIO streams correctly

Signed-off-by: pixiake <guofeng@spader-ai.com>
Co-authored-by: pixiake <guofeng@spader-ai.com>
2025-10-10 09:37:15 +02:00
Michele Dolfi
0610d01afa fix: enrichment of documents without pages metadata (pptx and xlsx) (#2401)
fix logic for pptx and xlsx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-07 18:28:51 +02:00
Maxim Lysak
9705f4020c fix: Proper heading support in rich tables for HTML backend (#2394)
* Fix for the proper headers support in rich tables in HTML

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaning up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Compatibility with older Python versions

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixing Furniture before the first heading rule

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added minimalistic test case

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* added html for the test

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-10-07 15:57:32 +02:00
Matvei Smirnov
ee73ffae15 fix(markdown): Setext heading support (#2359)
Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com>
Co-authored-by: Matvei Smirnov <matvei.smirnov@vkteam.ru>
2025-10-03 10:32:53 +02:00
Christoph Auer
ca2be7ff3a fix: Empty table handling (#2365)
* add table raw cells when no table structure model was used

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add RichTableCell instance for tables with missing structure.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-10-02 19:35:16 +02:00
Michele Dolfi
4f295ed051 fix: add table raw content when no table structure model is used (#1815)
* add table raw cells when no table structure model was used

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add RichTableCell instance for tables with missing structure.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-10-02 13:46:42 +02:00
Christoph Auer
1e9dc43b72 feat: Repetition-based StoppingCriteria for GraniteDocling (#2323)
* Experimental code for repetition detection, VLLM Streaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update VLLM Streaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update VLLM inference code, CLI and VLM specs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix generation and decoder args for HF model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix vllm device args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bugfixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove streaming VLLM for the moment

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add repetition StoppingCriteria for GraniteDocling/SmolDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make GenerationStopper base class and port for MLX

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add streaming support and custom GenerationStopper support for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for ApiVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix api_image_request_streaming when GenerationStopper triggers.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move DocTagsRepetitionStopper to utility unit, update examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-30 15:26:09 +02:00
Christoph Auer
654c70f990 fix: Update Transformers & VLLM inference code, CLI and VLM specs (#2322)
* Update VLLM inference code, CLI and VLM specs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix generation and decoder args for HF model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix vllm device args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bugfixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-29 21:06:54 +02:00
Maxim Lysak
c803abed9a feat: Rich tables support for HTML backend (#2324)
* Rich tables support for HTML backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* updated and added tests

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Refactored parse_table_data in html_backend into few smaller functions

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Changing scope of few functions in html_backend.py, making them static, when possible

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-09-29 18:12:16 +02:00
Lucas Morin
9d67bb9ed6 fix: support escaped characters in markdown backend (#2304)
fix: improve markdown backend to support input documents with escaped characters

Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>
2025-09-23 18:00:16 +02:00
Maxim Lysak
e2482a2ada feat: Rich tables for MSWord backend (#2291)
* Adding support of rich table cells to MSWord backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixes for properly accounting lists, pictures and headers in rich table cells

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up msword backend, re-generated docx tests

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added detection of simple table cells in word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-09-22 16:41:59 +02:00
Cesar Berrospi Ramis
46efaaefee feat: add a backend parser for WebVTT files (#2288)
* feat: add a backend parser for WebVTT files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: update README with VTT support

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: add description to supported formats

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade docling-core to unescape WebVTT in markdown

Pin the new release of docling-core 2.48.2.
Do not escape HTML reserved characters when exporting WebVTT documents to markdown.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: add missing copyright notice

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-22 15:24:34 +02:00
manuflexor
b5628f1227 fix: correct y-axis scaling in draw_table_cells (#2287)
* Fix y axis

* DCO Remediation Commit for manuflexor <imanuel@flexor.ai>

I, manuflexor <imanuel@flexor.ai>, hereby add my Signed-off-by to this commit: cd56622d4f

Signed-off-by: manuflexor <imanuel@flexor.ai>

---------

Signed-off-by: manuflexor <imanuel@flexor.ai>
2025-09-19 13:42:29 +02:00
Christoph Auer
17afb664d0 feat: Add granite-docling model (#2272)
* adding granite-docling preview

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the model specs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* typo

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use granite-docling and add to the model downloader

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs and README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update final repo_ids for GraniteDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update final repo_ids for GraniteDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix model name in CLI usage example

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Fix VLM model name in README.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-17 15:15:49 +02:00
dmorady1
0e95171dd6 feat(RapidOcr): Support generic extra arguments for RapidOcr (#2266)
* feat: add support for additional parameters in RapidOcrOptions and fix RapidOcr font_path

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* fix: RapidOcr ensure backwards compatibility and add deprecation note

* add warning log for rec_font_path

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* add tests for code coverage for rapidocr

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* add small comment for test

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ab893b637f

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

* fix test  comment

* DCO Remediation Commit for David Morady <29502285+dmorady1@users.noreply.github.com>

I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 133d989060
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 0a65eed28a
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ac96f1483f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: af5df4bb30
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: ab893b637f
I, David Morady <29502285+dmorady1@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 028e332aa9

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>

---------

Signed-off-by: David Morady <29502285+dmorady1@users.noreply.github.com>
2025-09-16 07:26:10 +02:00
Yuie.
609d902eef fix: handle empty result from RapidOCR to avoid crash (#2264)
Signed-off-by: Junehyuk Park <yuie@evonit.net>
2025-09-15 10:04:33 +02:00
Christoph Auer
0700af212c fix: Add missing features in ThreadedStandardPdfPipeline (#2252)
Add missing features in ThreadedStandardPdfPipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-11 16:26:02 +02:00
Michele Dolfi
2c9123419f feat: enrichment steps on all convert pipelines (incl docx, html, etc) (#2251)
* allow enrichment on all convert pipelines

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* set options in CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-11 15:09:00 +02:00
Michele Dolfi
c6965495a2 fix: address deprecation warnings of dependencies (#2237)
* switch to dtype instead of torch_dtype

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* set __check_model__ to avoid deprecation warnings

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove dataloaders warnings in easyocr

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* suppress with option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-10 14:38:34 +02:00
Peter W. J. Staar
b49d1ad4f1 feat: updating default parameters to get better performance with docling-parse (#2208)
* updated the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the parameters

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-09-05 14:06:21 +02:00
Peter W. J. Staar
b3d7542061 feat: updated the backend for new docling-parse (#2187)
* updated the backend and pyproject.toml

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the version and test files

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* forgot to add 1 updated test-file

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-09-05 10:42:31 +02:00
Alina Ryan
2c3f6faf3d chore: update deprecation note for OcrEngine (#2200)
This commit updates the deprecated note to correctly point to
get_ocr_factory().registered_kind.

Signed-off-by: Alina Ryan <aliryan@redhat.com>
2025-09-05 08:24:14 +02:00
Nikos Livathinos
e38aa0f7f2 feat: Heron layout model as new default (#1971)
* feat: Switch default layout model to DOCLING_LAYOUT_HERON. Update the unit test data.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Use default layout model in model_downloader default args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use default layout model in model_downloader default args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docling-models tag for TableFormer

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT (from linux CPU)

Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>

* fix: Ensure that the visualisations happen on copies of the page image

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Pinpoint docling-ibm-models to the fix branch for the ReadingOrderPredictor

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update uv.lock

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update tests GT to match the Heron layout model and the improved reading order model in Linux

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Introduce the verify_doctags optional parameter in conversion tests to control if a doctags
comparison should take place. Skip doctags comparisons for certain tests.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Generate tests GT on Mac

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Remove the pinning of the docling-ibm-models and use the release 3.9.1

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
2025-09-03 12:45:22 +02:00
Cesar Berrospi Ramis
293e81bf9d fix(html): access to variable not yet declared (#2171)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-02 07:59:55 +02:00
AndrewTsai0406
4d94e38223 fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata (#2039)
* Fix OCR bounding box misalignment caused by rotation metadata

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

* Add rotation-mismatch scanned pdf test case

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

* add ground truth for ocr_test_rotation_mismatch.pdf

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

* add ground truth for ocr_test_rotation_mismatch.pdf

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

* Updated test GT and merged from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix OCR test by excluding mismatched rotation example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-01 17:22:43 +02:00
Christoph Auer
9f4bc5b2f1 feat: [Beta] Extraction with schema (#2138)
* Add DocumentConverter.extract and full extraction pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add DocumentConverter.extract template arg

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add NuExtract model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add Extraction pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add proper test, support pydantic class types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add qr bill example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add base_extraction_pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update typing of ExtractionResult and inner fields

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Factor out extract to DocumentExtractor

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Address mypy issues

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add DocumentExtractor

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Resolve circular import issue

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports, remove Optional for template arg

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move new type definitions into datamodel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update comments

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Respect page-range, disable test_extraction for CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-01 16:09:48 +02:00
Qiefan Jiang
a283ccff25 feat(msexcel): set ContentLayer.INVISIBLE for invisible sheet (#1876)
* feat(msexcel): ignore invisible sheet

* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com>

I, Qiefan Jiang <jiangqiefan@bytedance.com>, hereby add my Signed-off-by to this commit: ca391f4908f44f301de54a97057f0b809f5ce66c

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

* retain invisible sheet with ContentLayer.INVISIBLE

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

* update UT

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

* fix: use Optional for python3.9

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com>

I, Qiefan Jiang <jiangqiefan@bytedance.com>, hereby add my Signed-off-by to this commit: a34371a90e

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

---------

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>
2025-09-01 13:53:45 +02:00
geoHeil
9904d14e6a fix: extend offline mode for rapidocr fonts (#2155)
feat: enable offline mode for docling models

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2025-09-01 09:15:47 +02:00
Cesar Berrospi Ramis
fa3327e1a6 fix(html): preserve code blocks in list items (#2131)
* chore(html): refactor parser to leverage context managers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): parse inline code snippets, also from list items

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(html): remove hidden tags

Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-08-26 06:43:48 +02:00
geoHeil
3f60a0fa78 feat: Upgrade to RapidOCR 3.x (#2088)
* feat: exploring new version

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: autoformat

DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775

* feat: enable configurable runtime for rapidocr and handle new result better;

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: fix linter

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: use new server model

* chore:  change default engine type to onnx

* chore: tests update for new rapidocr

* fix: rebase from main and fix clashes

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 02f9db85f562e5cdfda40c52fee55cfd4030d70a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a7bcb205faedb881f94a89b3bbd29cb31ccd54f0
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a39482a98cbcff7a825c8321134732af0c65930a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63e9d717fa26951566b02761f3fdfc752c31f805
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ef12a6ec1ea2846a8a8e2e776eeaa59c2a0c4dfe

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2222d2340387f8d9d66f3ca9d8e21a0945a44e7a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bc6a1dc507d7f146ec4797a2d3840414f46ac64d
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 56e0d67da7c57d4b5caf8eaef8dff7056c3efd32
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 871ca21271412006c76acf3c19426140efed3d50
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7b1b77159da729d483a581a86c7309acba1712a7
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a792a714a43e19a91b2b782f54621c1c5efda632

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d1fed26323ff829b716bc667fe69532839363e45
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 346ec1cad943765f886e5d17fb0a54221124689c
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4d0bbe5bd6e9f7261b97362ff8823af244267089
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 34a5ad53892a7064a6bf35f890d344d464c78b2f
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 9151959db3ad53535011d1cfdcf9181fdf936bb1
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 8ef5536f2c098826c6c0a05190f8a80614c3f3cb

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7e18637a35
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63fb8ff599
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 0cb9444fb8
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 38940d9978
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: b6d461ac42
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ee55eb3408

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2025-08-25 12:10:33 +02:00
Raphael Norman-Tenazas
6736e66bb4 style: show converted page count in PaginatedPipeline debug statement (#2124)
* Show converted page count in PaginatedPipeline debug statement

* DCO Remediation Commit for Raphael Norman-Tenazas <tenazasr@gmail.com>

I, Raphael Norman-Tenazas <tenazasr@gmail.com>, hereby add my Signed-off-by to this commit: b7930bf56d

Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com>

* Show total progress instead of batch size

Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com>

---------

Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com>
2025-08-23 12:13:20 +02:00
VIktor Kuropiantnyk
cdf079dd06 feat(CLI): Option to download arbitrary HuggingFace model (#2123)
* Added option to docling-tools to download arbitrary HuggingFace model

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>

* Added note in documentation

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>

* Removed note on custom artifact path usage from HF download option

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>

* Fixed typo

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>

---------

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>
2025-08-22 15:23:29 +02:00
Christoph Auer
3c660c0511 feat: batching support for VLMs in transformers backend, add initial VLLM backend (#2094)
* Prepare existing codes for use with new multi-stage VLM pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add multithreaded VLM pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add VLM task interpreters

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add VLM task interpreters

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove prints

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix KeyboardInterrupt behaviour

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add VLLM backend support, optimize process_images

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Tweak defaults

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Implement proper batch inference for HuggingFaceTransformersVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup hf_transformers_model batching impl

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust example instatiation of multi-stage VLM pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add GoT OCR 2.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Factor out changes without multi-stage pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset defaults for generation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add torch.compile, fix temperature setting in gen_kwargs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Expose page_batch_size on CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add torch_dtype bfloat16 to SMOLDOCLING and SMOLVLM model spec

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clip off pad_token

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-22 13:17:33 +02:00
Nikhil Verma
3f03709885 fix: Improve numbered list detection for msword docs (#2100)
* Improve numbered list detection for msword docs

This fixes the list detection in MSWord docs by properly tracking and counting
the list entries. It fixes
https://github.com/docling-project/docling/issues/2090

* DCO Remediation Commit for Nikhil Verma <nikhilgotmail@gmail.com>

I, Nikhil Verma <nikhilgotmail@gmail.com>, hereby add my Signed-off-by to this commit: 509da6658e

Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>

---------

Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>
2025-08-22 10:38:34 +02:00
krrome
94fcc46aa9 feat(html): Support formatting tags in HTML texts (#2111)
* add parsing for formatting tags in HTML backend

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

fix latest tests + wiki_duck result files.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* convert _collect_parent_format_tags to staticmethod

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

---------

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
2025-08-22 10:37:34 +02:00
Christoph Auer
5f57ff2a45 perf: Clean up resources with docling-parse v4, no parsed_page output by default (#2105)
* Call PdfDocument.unload_pages from the pipelines where needed, delete parsed_page data unless requested to keep

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* pin docling-parse and update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Reinstate pipeline_options.generate_parsed_page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-20 10:46:31 +02:00
Cesar Berrospi Ramis
c5f2e2fdd6 fix(HTML): parse footer tag as a group in furniture content layer (#2106)
* fix(HTML): parse footer tag as a section in furniture

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): add test for body vs furniture in HTML parser.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-08-20 08:42:25 +02:00
mohammed ahmed
8820b5558b perf: speed up function _parse_orientation (#1934)
* ️ Speed up function `_parse_orientation` by 242%
Here’s how you should rewrite the code for **maximum speed** based on your profiler.

- The _bottleneck_ is the line  
  ```python
  orientations = df_osd.loc[df_osd["key"] == "Orientation in degrees"].value.tolist()
  ```
  This does a dataframe filtering (`loc`) and then materializes a list for every call, which is slow.

- We can **vectorize** this search (avoid repeated boolean masking and conversion).
    - Instead of `.loc[df_osd["key"] == ...].value.tolist()`, use `.at[idx, 'value']` where `idx` is the first index where key matches, or better, `.values[0]` after a fast boolean mask.  
    - Since you only use the *first* matching value, you don’t need the full filtered column.

- You can optimize `parse_tesseract_orientation` by.
    - Storing `CLIPPED_ORIENTATIONS` as a set for O(1) lookup if it isn't already (can't change the global so just memoize locally).
    - Remove unnecessary steps.

**Here is your optimized code:**



**Why is this faster?**

- `_fast_get_orientation_value`:  
  - Avoids all index alignment overhead of `df.loc`.
  - Uses numpy arrays under the hood (thanks to `.values`) for direct boolean masking and fast nonzero lookup.
  - Fetches just the first match directly, skipping conversion to lists.
- Only fetches and processes the single cell you actually want.

**If you’re sure there’s always exactly one match:**  
You can simplify `_fast_get_orientation_value` to.



Or, if always sorted and single.


---

- **No semantics changed.**
- **Comments unchanged unless part modified.**

This approach should reduce the time spent in `_parse_orientation()` by almost two orders of magnitude, especially as the DataFrame grows.  
Let me know if you want further micro-optimizations (e.g., Cython, pre-fetched numpy conversions, etc.)!

* fix: pandas vet error

* DCO Remediation Commit for mohammed <mohammed18200118@gmail.com>

I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: d9824749bb

Signed-off-by: mohammed <mohammed18200118@gmail.com>

* Dummy commit to trigger CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: mohammed <mohammed18200118@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-19 10:55:18 +02:00
Matteo
d2494da8b8 feat: new code formula model (#2042)
* new code formula model

Signed-off-by: mao <mao@lenny.zuvela.ibm.com>

* new model on hf

Signed-off-by: mao <mao@tabby.zuvela.ibm.com>

* pre-commits

Signed-off-by: mao <mao@login-c.zuvela.ibm.com>

* remove MPS

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: mao <mao@lenny.zuvela.ibm.com>
Signed-off-by: mao <mao@tabby.zuvela.ibm.com>
Signed-off-by: mao <mao@login-c.zuvela.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: mao <mao@lenny.zuvela.ibm.com>
Co-authored-by: mao <mao@tabby.zuvela.ibm.com>
Co-authored-by: mao <mao@login-c.zuvela.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-18 16:01:46 +02:00
Michele Dolfi
31087f3fcc feat: add backend for METS with Google Books profile (#1989)
* add backend for METS with Google Books profile

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Fixes for cell indexing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* use HTMLParser and add options from CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix typing and unloading

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restore guess format

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename inputformat

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use PdfDocumentBackend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use test file from test folder (still missing)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-18 11:43:20 +02:00
krrome
9687297262 feat(html): Support in-line anchor tags in HTML texts (#1659)
* re-implement links for html backend.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* implement hack for images.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

---------

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
2025-08-18 09:57:16 +02:00
Shkarupa Alex
5f050f94e1 feat(vlm): Ability to preprocess VLM response (#1907)
* Add ability to preprocess VLM response

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

* Move response decoding to vlm options (requires inheritance to override). Per-page prompt formulation also moved to vlm options to keep api consistent.

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

---------

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
2025-08-12 15:20:24 +02:00
Peter W. J. Staar
b09033cb73 feat: add convert_string to document-converter (#2069)
* feat: add convert_string to document-converter

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix unsupported operand type(s) for |: type and NoneType

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added tests for convert_string

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-08-12 11:02:38 +02:00
Maroun Touma
ed56f2de5d fix(html): Parse rawspan and colspan when they include non numerical values (#2048)
* use re to stop at first non-digit

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* Allow digit in first place followed by non numerical values

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* refactor to match type checker

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Maroun Touma <touma@us.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-11 13:53:29 +02:00
TwoLeaves
0130e3ae96 fix: support new mlx-vlm module (#2001)
* fix stream_generate import statement

Signed-off-by: TwoLeaves <ohneherren@gmail.com>

* pin new mlx-vlm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: TwoLeaves <ohneherren@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-31 14:13:17 +02:00
Michele Dolfi
2eb760d060 fix: extend error reporting when verbose logging is enabled (#2017)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-30 11:23:26 +02:00
Cesar Berrospi Ramis
86f70128aa fix(HTML): replace non-standard Unicode characters (#2006)
chore(HTML): replace non-standard Unicode characters for beter downstream tasks

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-29 11:05:35 +02:00
Christoph Auer
aed772ab33 feat: Threaded PDF pipeline (#1951)
* Initial async pdf pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* UpstreamAwareQueue

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Refactoring into async pipeline primitives and graph

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanups and safety improvements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Better threaded PDF pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-ibm-models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unused args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revise pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Unload doc backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revert "Unload doc backend"

This reverts commit 01066f0b6e.

* Remove redundant method

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update threaded test

Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>

* Stop accumulating docs in test run

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: don't starve on docs with > max_queue_size pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: don't starve on docs with > max_queue_size pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>

I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: fa71cde950
I, Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>, hereby add my Signed-off-by to this commit: d66da87d96

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: python3.9 compat

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Option to enable threadpool with doc_batch_concurrency setting

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up unused code

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix settings defaults expectations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use released docling-ibm-models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove ignores for typing/linting

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
2025-07-26 11:49:37 +02:00
Cesar Berrospi Ramis
aec29a7315 fix(markdown): ensure correct parsing of nested lists (#1995)
* fix(markdown): ensure correct parsing of nested lists

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update dependencies in uv.lock file

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-25 15:17:57 +02:00
Cesar Berrospi Ramis
945721a15d fix(HTML): remove an unnecessary print command (#1988)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-25 08:45:15 +02:00