Commit Graph

740 Commits

Author SHA1 Message Date
Nikos Livathinos
e38aa0f7f2 feat: Heron layout model as new default (#1971)
* feat: Switch default layout model to DOCLING_LAYOUT_HERON. Update the unit test data.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Use default layout model in model_downloader default args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use default layout model in model_downloader default args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docling-models tag for TableFormer

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT (from linux CPU)

Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>

* fix: Ensure that the visualisations happen on copies of the page image

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Pinpoint docling-ibm-models to the fix branch for the ReadingOrderPredictor

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update uv.lock

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update tests GT to match the Heron layout model and the improved reading order model in Linux

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Introduce the verify_doctags optional parameter in conversion tests to control if a doctags
comparison should take place. Skip doctags comparisons for certain tests.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Generate tests GT on Mac

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Remove the pinning of the docling-ibm-models and use the release 3.9.1

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
2025-09-03 12:45:22 +02:00
Cesar Berrospi Ramis
293e81bf9d fix(html): access to variable not yet declared (#2171)
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-02 07:59:55 +02:00
github-actions[bot]
d68d8b678e chore: bump version to 2.49.0 [skip ci] v2.49.0 2025-09-01 16:39:43 +00:00
AndrewTsai0406
4d94e38223 fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata (#2039)
* Fix OCR bounding box misalignment caused by rotation metadata

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

* Add rotation-mismatch scanned pdf test case

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

* add ground truth for ocr_test_rotation_mismatch.pdf

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

* add ground truth for ocr_test_rotation_mismatch.pdf

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

* Updated test GT and merged from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix OCR test by excluding mismatched rotation example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-01 17:22:43 +02:00
Christoph Auer
9f4bc5b2f1 feat: [Beta] Extraction with schema (#2138)
* Add DocumentConverter.extract and full extraction pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add DocumentConverter.extract template arg

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add NuExtract model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add Extraction pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add proper test, support pydantic class types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add qr bill example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add base_extraction_pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update typing of ExtractionResult and inner fields

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Factor out extract to DocumentExtractor

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Address mypy issues

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add DocumentExtractor

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Resolve circular import issue

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports, remove Optional for template arg

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move new type definitions into datamodel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update comments

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Respect page-range, disable test_extraction for CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-09-01 16:09:48 +02:00
Qiefan Jiang
a283ccff25 feat(msexcel): set ContentLayer.INVISIBLE for invisible sheet (#1876)
* feat(msexcel): ignore invisible sheet

* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com>

I, Qiefan Jiang <jiangqiefan@bytedance.com>, hereby add my Signed-off-by to this commit: ca391f4908f44f301de54a97057f0b809f5ce66c

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

* retain invisible sheet with ContentLayer.INVISIBLE

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

* update UT

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

* fix: use Optional for python3.9

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com>

I, Qiefan Jiang <jiangqiefan@bytedance.com>, hereby add my Signed-off-by to this commit: a34371a90e

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>

---------

Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>
2025-09-01 13:53:45 +02:00
Panos Vagenas
be26044f14 chore: update docling-core lock (#2169)
* chore: upgrade docling-core

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* upgrade lock

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-09-01 13:46:10 +02:00
Shikhar Bhardwaj
9f0286bcac fix: translation example (#2166)
* fix: translation example

Signed-off-by: shikharbhardwaj <8502456+shikharbhardwaj@users.noreply.github.com>

* Fix translation example formatting

Signed-off-by: shikharbhardwaj <8502456+shikharbhardwaj@users.noreply.github.com>

---------

Signed-off-by: shikharbhardwaj <8502456+shikharbhardwaj@users.noreply.github.com>
2025-09-01 11:04:46 +02:00
geoHeil
9904d14e6a fix: extend offline mode for rapidocr fonts (#2155)
feat: enable offline mode for docling models

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2025-09-01 09:15:47 +02:00
Panos Vagenas
96cab6b536 docs: enrich landing pages (#2165)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-08-29 17:19:05 +02:00
github-actions[bot]
fb3b7b93ae chore: bump version to 2.48.0 [skip ci] v2.48.0 2025-08-26 05:29:31 +00:00
Cesar Berrospi Ramis
fa3327e1a6 fix(html): preserve code blocks in list items (#2131)
* chore(html): refactor parser to leverage context managers

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(html): parse inline code snippets, also from list items

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore(html): remove hidden tags

Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-08-26 06:43:48 +02:00
Michele Dolfi
c0268416cf chore: add analytics (#2133)
add analytics

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-25 18:25:38 +02:00
Michele Dolfi
d32d2c97e1 chore: PR approval reminder (#2132)
PR approval reminder

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-25 15:08:37 +02:00
geoHeil
3f60a0fa78 feat: Upgrade to RapidOCR 3.x (#2088)
* feat: exploring new version

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: autoformat

DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775

* feat: enable configurable runtime for rapidocr and handle new result better;

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: fix linter

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* chore: use new server model

* chore:  change default engine type to onnx

* chore: tests update for new rapidocr

* fix: rebase from main and fix clashes

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 02f9db85f562e5cdfda40c52fee55cfd4030d70a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a7bcb205faedb881f94a89b3bbd29cb31ccd54f0
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a39482a98cbcff7a825c8321134732af0c65930a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63e9d717fa26951566b02761f3fdfc752c31f805
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ef12a6ec1ea2846a8a8e2e776eeaa59c2a0c4dfe

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2222d2340387f8d9d66f3ca9d8e21a0945a44e7a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bc6a1dc507d7f146ec4797a2d3840414f46ac64d
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 56e0d67da7c57d4b5caf8eaef8dff7056c3efd32
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 871ca21271412006c76acf3c19426140efed3d50
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7b1b77159da729d483a581a86c7309acba1712a7
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a792a714a43e19a91b2b782f54621c1c5efda632

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d1fed26323ff829b716bc667fe69532839363e45
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 346ec1cad943765f886e5d17fb0a54221124689c
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4d0bbe5bd6e9f7261b97362ff8823af244267089
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 34a5ad53892a7064a6bf35f890d344d464c78b2f
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 9151959db3ad53535011d1cfdcf9181fdf936bb1
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 8ef5536f2c098826c6c0a05190f8a80614c3f3cb

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>

I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7e18637a35
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63fb8ff599
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 0cb9444fb8
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 38940d9978
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: b6d461ac42
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ee55eb3408

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2025-08-25 12:10:33 +02:00
github-actions[bot]
2aef5cf328 chore: bump version to 2.47.1 [skip ci] v2.47.1 2025-08-23 14:11:33 +00:00
Michele Dolfi
488f6cdd2d fix: vllm extra only for linux x86_64 (#2126)
vllm extra only for linux x86_64

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-23 13:33:15 +02:00
Raphael Norman-Tenazas
6736e66bb4 style: show converted page count in PaginatedPipeline debug statement (#2124)
* Show converted page count in PaginatedPipeline debug statement

* DCO Remediation Commit for Raphael Norman-Tenazas <tenazasr@gmail.com>

I, Raphael Norman-Tenazas <tenazasr@gmail.com>, hereby add my Signed-off-by to this commit: b7930bf56d

Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com>

* Show total progress instead of batch size

Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com>

---------

Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com>
2025-08-23 12:13:20 +02:00
github-actions[bot]
b04e205d1e chore: bump version to 2.47.0 [skip ci] v2.47.0 2025-08-22 14:15:39 +00:00
VIktor Kuropiantnyk
cdf079dd06 feat(CLI): Option to download arbitrary HuggingFace model (#2123)
* Added option to docling-tools to download arbitrary HuggingFace model

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>

* Added note in documentation

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>

* Removed note on custom artifact path usage from HF download option

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>

* Fixed typo

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>

---------

Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>
2025-08-22 15:23:29 +02:00
Michele Dolfi
449bde0a6c test: update docx reference results (#2122)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-22 14:26:36 +02:00
Christoph Auer
3c660c0511 feat: batching support for VLMs in transformers backend, add initial VLLM backend (#2094)
* Prepare existing codes for use with new multi-stage VLM pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add multithreaded VLM pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add VLM task interpreters

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add VLM task interpreters

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove prints

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix KeyboardInterrupt behaviour

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add VLLM backend support, optimize process_images

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Tweak defaults

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Implement proper batch inference for HuggingFaceTransformersVlmModel

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup hf_transformers_model batching impl

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust example instatiation of multi-stage VLM pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add GoT OCR 2.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Factor out changes without multi-stage pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset defaults for generation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add torch.compile, fix temperature setting in gen_kwargs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Expose page_batch_size on CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add torch_dtype bfloat16 to SMOLDOCLING and SMOLVLM model spec

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clip off pad_token

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-22 13:17:33 +02:00
Nikhil Verma
3f03709885 fix: Improve numbered list detection for msword docs (#2100)
* Improve numbered list detection for msword docs

This fixes the list detection in MSWord docs by properly tracking and counting
the list entries. It fixes
https://github.com/docling-project/docling/issues/2090

* DCO Remediation Commit for Nikhil Verma <nikhilgotmail@gmail.com>

I, Nikhil Verma <nikhilgotmail@gmail.com>, hereby add my Signed-off-by to this commit: 509da6658e

Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>

---------

Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>
2025-08-22 10:38:34 +02:00
krrome
94fcc46aa9 feat(html): Support formatting tags in HTML texts (#2111)
* add parsing for formatting tags in HTML backend

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

fix latest tests + wiki_duck result files.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* convert _collect_parent_format_tags to staticmethod

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

---------

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
2025-08-22 10:37:34 +02:00
Maroun Touma
e76298c40d docs: DPK pipeline example using docling library (#2112)
* Notebook showing example on how to use docling transforms in DPK

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* fix HF Token name

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* use %pip instead of pip install jupyter lab

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* run formatter

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* add example to mkdocs and fix typo

Signed-off-by: Maroun Touma <touma@us.ibm.com>

---------

Signed-off-by: Maroun Touma <touma@us.ibm.com>
2025-08-21 10:14:36 +02:00
Panos Vagenas
8996d612aa docs: add Getting Started page (#2113)
* docs: add Getting Started page

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* refactor usage

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* minor renaming

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-08-21 08:44:53 +02:00
github-actions[bot]
555506d8e6 chore: bump version to 2.46.0 [skip ci] v2.46.0 2025-08-20 15:25:07 +00:00
Panos Vagenas
76d2cb76b3 chore: update docling-core lock (#2110)
* chore: pre-check docling-core 2.45.0

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update -core pinning

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-08-20 16:41:48 +02:00
Christoph Auer
5f57ff2a45 perf: Clean up resources with docling-parse v4, no parsed_page output by default (#2105)
* Call PdfDocument.unload_pages from the pipelines where needed, delete parsed_page data unless requested to keep

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* pin docling-parse and update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Reinstate pipeline_options.generate_parsed_page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-20 10:46:31 +02:00
Cesar Berrospi Ramis
c5f2e2fdd6 fix(HTML): parse footer tag as a group in furniture content layer (#2106)
* fix(HTML): parse footer tag as a section in furniture

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* fix(HTML): add test for body vs furniture in HTML parser.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-08-20 08:42:25 +02:00
mohammed ahmed
8820b5558b perf: speed up function _parse_orientation (#1934)
* ️ Speed up function `_parse_orientation` by 242%
Here’s how you should rewrite the code for **maximum speed** based on your profiler.

- The _bottleneck_ is the line  
  ```python
  orientations = df_osd.loc[df_osd["key"] == "Orientation in degrees"].value.tolist()
  ```
  This does a dataframe filtering (`loc`) and then materializes a list for every call, which is slow.

- We can **vectorize** this search (avoid repeated boolean masking and conversion).
    - Instead of `.loc[df_osd["key"] == ...].value.tolist()`, use `.at[idx, 'value']` where `idx` is the first index where key matches, or better, `.values[0]` after a fast boolean mask.  
    - Since you only use the *first* matching value, you don’t need the full filtered column.

- You can optimize `parse_tesseract_orientation` by.
    - Storing `CLIPPED_ORIENTATIONS` as a set for O(1) lookup if it isn't already (can't change the global so just memoize locally).
    - Remove unnecessary steps.

**Here is your optimized code:**



**Why is this faster?**

- `_fast_get_orientation_value`:  
  - Avoids all index alignment overhead of `df.loc`.
  - Uses numpy arrays under the hood (thanks to `.values`) for direct boolean masking and fast nonzero lookup.
  - Fetches just the first match directly, skipping conversion to lists.
- Only fetches and processes the single cell you actually want.

**If you’re sure there’s always exactly one match:**  
You can simplify `_fast_get_orientation_value` to.



Or, if always sorted and single.


---

- **No semantics changed.**
- **Comments unchanged unless part modified.**

This approach should reduce the time spent in `_parse_orientation()` by almost two orders of magnitude, especially as the DataFrame grows.  
Let me know if you want further micro-optimizations (e.g., Cython, pre-fetched numpy conversions, etc.)!

* fix: pandas vet error

* DCO Remediation Commit for mohammed <mohammed18200118@gmail.com>

I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: d9824749bb

Signed-off-by: mohammed <mohammed18200118@gmail.com>

* Dummy commit to trigger CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: mohammed <mohammed18200118@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-19 10:55:18 +02:00
Michele Dolfi
956f82f115 chore: upgrade dependencies in lock file (#2093)
* chore: upgrade lock file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix(markdown): update binary hash of a markdown backend ground truth file

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-08-19 10:11:44 +02:00
Matteo
d2494da8b8 feat: new code formula model (#2042)
* new code formula model

Signed-off-by: mao <mao@lenny.zuvela.ibm.com>

* new model on hf

Signed-off-by: mao <mao@tabby.zuvela.ibm.com>

* pre-commits

Signed-off-by: mao <mao@login-c.zuvela.ibm.com>

* remove MPS

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: mao <mao@lenny.zuvela.ibm.com>
Signed-off-by: mao <mao@tabby.zuvela.ibm.com>
Signed-off-by: mao <mao@login-c.zuvela.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: mao <mao@lenny.zuvela.ibm.com>
Co-authored-by: mao <mao@tabby.zuvela.ibm.com>
Co-authored-by: mao <mao@login-c.zuvela.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-18 16:01:46 +02:00
github-actions[bot]
c3a7d1d999 chore: bump version to 2.45.0 [skip ci] v2.45.0 2025-08-18 10:25:51 +00:00
Michele Dolfi
31087f3fcc feat: add backend for METS with Google Books profile (#1989)
* add backend for METS with Google Books profile

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Fixes for cell indexing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* use HTMLParser and add options from CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix typing and unloading

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restore guess format

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename inputformat

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use PdfDocumentBackend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use test file from test folder (still missing)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-18 11:43:20 +02:00
krrome
9687297262 feat(html): Support in-line anchor tags in HTML texts (#1659)
* re-implement links for html backend.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

* implement hack for images.

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

---------

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
2025-08-18 09:57:16 +02:00
Eric Deandrea
76c1fbd6e8 docs: Add docling Quarkus integration (#2083)
* Add docling Quarkus integration

* DCO Remediation Commit for Eric Deandrea <eric.deandrea@ibm.com>

I, Eric Deandrea <eric.deandrea@ibm.com>, hereby add my Signed-off-by to this commit: 86aa0b80f4

Signed-off-by: Eric Deandrea <eric.deandrea@ibm.com>

---------

Signed-off-by: Eric Deandrea <eric.deandrea@ibm.com>
2025-08-18 06:55:51 +02:00
Shkarupa Alex
5f050f94e1 feat(vlm): Ability to preprocess VLM response (#1907)
* Add ability to preprocess VLM response

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

* Move response decoding to vlm options (requires inheritance to override). Per-page prompt formulation also moved to vlm options to keep api consistent.

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>

---------

Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
2025-08-12 15:20:24 +02:00
github-actions[bot]
ccfee05847 chore: bump version to 2.44.0 [skip ci] v2.44.0 2025-08-12 09:51:35 +00:00
Peter W. J. Staar
b09033cb73 feat: add convert_string to document-converter (#2069)
* feat: add convert_string to document-converter

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix unsupported operand type(s) for |: type and NoneType

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added tests for convert_string

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-08-12 11:02:38 +02:00
Panos Vagenas
e2cca931be docs: add Langflow integration (#2068)
* docs: add langflow integration

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix link

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-08-11 16:03:29 +02:00
Maroun Touma
ed56f2de5d fix(html): Parse rawspan and colspan when they include non numerical values (#2048)
* use re to stop at first non-digit

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* Allow digit in first place followed by non numerical values

Signed-off-by: Maroun Touma <touma@us.ibm.com>

* refactor to match type checker

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Maroun Touma <touma@us.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-08-11 13:53:29 +02:00
Thomas Vitale
bfda6d34d8 docs: Add Arconia integration (#2061)
Signed-off-by: Thomas Vitale <ThomasVitale@users.noreply.github.com>
2025-08-08 09:35:47 +02:00
Michele Dolfi
c5f49dc2db chore: upgrade locked dependencies (#2024)
lock new deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-31 16:05:27 +02:00
TwoLeaves
0130e3ae96 fix: support new mlx-vlm module (#2001)
* fix stream_generate import statement

Signed-off-by: TwoLeaves <ohneherren@gmail.com>

* pin new mlx-vlm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: TwoLeaves <ohneherren@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-31 14:13:17 +02:00
Michele Dolfi
2eb760d060 fix: extend error reporting when verbose logging is enabled (#2017)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-30 11:23:26 +02:00
Cesar Berrospi Ramis
86f70128aa fix(HTML): replace non-standard Unicode characters (#2006)
chore(HTML): replace non-standard Unicode characters for beter downstream tasks

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-29 11:05:35 +02:00
github-actions[bot]
aae42b37a8 chore: bump version to 2.43.0 [skip ci] v2.43.0 2025-07-28 09:45:53 +00:00
Christoph Auer
aed772ab33 feat: Threaded PDF pipeline (#1951)
* Initial async pdf pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* UpstreamAwareQueue

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Refactoring into async pipeline primitives and graph

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanups and safety improvements

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Better threaded PDF pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-ibm-models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unused args

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revise pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Unload doc backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revert "Unload doc backend"

This reverts commit 01066f0b6e.

* Remove redundant method

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update threaded test

Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>

* Stop accumulating docs in test run

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: don't starve on docs with > max_queue_size pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: don't starve on docs with > max_queue_size pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>

I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: fa71cde950
I, Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>, hereby add my Signed-off-by to this commit: d66da87d96

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix: python3.9 compat

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Option to enable threadpool with doc_batch_concurrency setting

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up unused code

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix settings defaults expectations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use released docling-ibm-models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove ignores for typing/linting

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
2025-07-26 11:49:37 +02:00
Cesar Berrospi Ramis
aec29a7315 fix(markdown): ensure correct parsing of nested lists (#1995)
* fix(markdown): ensure correct parsing of nested lists

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update dependencies in uv.lock file

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-07-25 15:17:57 +02:00