docling

mirror of https://github.com/DS4SD/docling.git synced 2025-12-08 20:58:11 +00:00

Author	SHA1	Message	Date
Yuie.	609d902eef	fix: handle empty result from RapidOCR to avoid crash (#2264 ) Signed-off-by: Junehyuk Park <yuie@evonit.net>	2025-09-15 10:04:33 +02:00
github-actions[bot]	10bb0aee2d	chore: bump version to 2.52.0 [skip ci] v2.52.0	2025-09-11 16:11:20 +00:00
Christoph Auer	0700af212c	fix: Add missing features in ThreadedStandardPdfPipeline (#2252 ) Add missing features in ThreadedStandardPdfPipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-09-11 16:26:02 +02:00
Michele Dolfi	2c9123419f	feat: enrichment steps on all convert pipelines (incl docx, html, etc) (#2251 ) * allow enrichment on all convert pipelines Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * set options in CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-09-11 15:09:00 +02:00
Michele Dolfi	c6965495a2	fix: address deprecation warnings of dependencies (#2237 ) * switch to dtype instead of torch_dtype Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * set __check_model__ to avoid deprecation warnings Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove dataloaders warnings in easyocr Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * suppress with option Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-09-10 14:38:34 +02:00
Cesar Berrospi Ramis	f8cc545bab	docs: add an example of RAG with OpenSearch (#2238 ) * docs: add an example of RAG with OpeanSearch Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore: pin latest docling-core and update uv.lock Pin latest version release of docling-core in pyproject.toml Update the dependencies in uv.lock file Run the notebook rag_opensearch.ipynb to pick up changes from docling-core Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-09-10 14:37:22 +02:00
Roy Derks	e5cd7020bd	docs: Add instructions for using Docling with MCP to README (#2219 ) * docs: Add instructions for using Docling with MCP to README * DCO Remediation Commit for Roy Derks <10717410+royderks@users.noreply.github.com> Signed-off-by: Roy Derks <roy.derks@ibm.com> * DCO Remediation Commit for Roy Derks <10717410+royderks@users.noreply.github.com> I, Roy Derks <10717410+royderks@users.noreply.github.com>, hereby add my Signed-off-by to this commit: `4b9ba1d0ef` Signed-off-by: Roy Derks <roy.derks@ibm.com> * docs: reorganize documentation on MCP server Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: align README with documentation index page Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Roy Derks <roy.derks@ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Roy Derks <roy.derks@ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-09-10 10:02:28 +02:00
Tamás Bitai	55f5f3752f	docs: Document VLM support requirement in extraction example (#2231 ) * docs: Document VLM support requirement in extraction example * DCO Remediation Commit for Tamás Bitai <bitai.tamas@gmail.com> I, Tamás Bitai <bitai.tamas@gmail.com>, hereby add my Signed-off-by to this commit: `b90defdb77` Signed-off-by: Tamás Bitai <bitai.tamas@gmail.com> --------- Signed-off-by: Tamás Bitai <bitai.tamas@gmail.com>	2025-09-09 13:45:55 +02:00
github-actions[bot]	df60673992	chore: bump version to 2.51.0 [skip ci] v2.51.0	2025-09-05 13:01:33 +00:00
Peter W. J. Staar	b49d1ad4f1	feat: updating default parameters to get better performance with docling-parse (#2208 ) * updated the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the parameters Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-09-05 14:06:21 +02:00
Panos Vagenas	a9f41b088e	docs: add information extraction example (#2199 ) * docs: add information exctraction example Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update README Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * minor typo Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update README Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-09-05 11:27:09 +02:00
Peter W. J. Staar	b3d7542061	feat: updated the backend for new docling-parse (#2187 ) * updated the backend and pyproject.toml Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the version and test files Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> * forgot to add 1 updated test-file Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-09-05 10:42:31 +02:00
Alina Ryan	2c3f6faf3d	chore: update deprecation note for OcrEngine (#2200 ) This commit updates the deprecated note to correctly point to get_ocr_factory().registered_kind. Signed-off-by: Alina Ryan <aliryan@redhat.com>	2025-09-05 08:24:14 +02:00
github-actions[bot]	3419c42f10	chore: bump version to 2.50.0 [skip ci] v2.50.0	2025-09-03 11:39:08 +00:00
Nikos Livathinos	e38aa0f7f2	feat: Heron layout model as new default (#1971 ) * feat: Switch default layout model to DOCLING_LAYOUT_HERON. Update the unit test data. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Use default layout model in model_downloader default args Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Use default layout model in model_downloader default args Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update docling-models tag for TableFormer Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT (from linux CPU) Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal> * fix: Ensure that the visualisations happen on copies of the page image Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Pinpoint docling-ibm-models to the fix branch for the ReadingOrderPredictor Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update uv.lock Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update tests GT to match the Heron layout model and the improved reading order model in Linux Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Introduce the verify_doctags optional parameter in conversion tests to control if a doctags comparison should take place. Skip doctags comparisons for certain tests. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Generate tests GT on Mac Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Remove the pinning of the docling-ibm-models and use the release 3.9.1 Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>	2025-09-03 12:45:22 +02:00
Cesar Berrospi Ramis	293e81bf9d	fix(html): access to variable not yet declared (#2171 ) Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-09-02 07:59:55 +02:00
github-actions[bot]	d68d8b678e	chore: bump version to 2.49.0 [skip ci] v2.49.0	2025-09-01 16:39:43 +00:00
AndrewTsai0406	4d94e38223	fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata (#2039 ) * Fix OCR bounding box misalignment caused by rotation metadata Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com> * Add rotation-mismatch scanned pdf test case Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com> * add ground truth for ocr_test_rotation_mismatch.pdf Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com> * add ground truth for ocr_test_rotation_mismatch.pdf Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com> * Updated test GT and merged from main Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix OCR test by excluding mismatched rotation example Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-09-01 17:22:43 +02:00
Christoph Auer	9f4bc5b2f1	feat: [Beta] Extraction with schema (#2138 ) * Add DocumentConverter.extract and full extraction pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add DocumentConverter.extract template arg Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add NuExtract model Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add Extraction pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add proper test, support pydantic class types Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add qr bill example Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add base_extraction_pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add types Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update typing of ExtractionResult and inner fields Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Factor out extract to DocumentExtractor Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Address mypy issues Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add DocumentExtractor Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Resolve circular import issue Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clean up imports, remove Optional for template arg Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move new type definitions into datamodel Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update comments Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Respect page-range, disable test_extraction for CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-09-01 16:09:48 +02:00
Qiefan Jiang	a283ccff25	feat(msexcel): set ContentLayer.INVISIBLE for invisible sheet (#1876 ) * feat(msexcel): ignore invisible sheet * DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com> I, Qiefan Jiang <jiangqiefan@bytedance.com>, hereby add my Signed-off-by to this commit: ca391f4908f44f301de54a97057f0b809f5ce66c Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> * retain invisible sheet with ContentLayer.INVISIBLE Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> * update UT Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> * fix: use Optional for python3.9 Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> * DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com> I, Qiefan Jiang <jiangqiefan@bytedance.com>, hereby add my Signed-off-by to this commit: `a34371a90e` Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> --------- Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>	2025-09-01 13:53:45 +02:00
Panos Vagenas	be26044f14	chore: update docling-core lock (#2169 ) * chore: upgrade docling-core Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * upgrade lock Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-09-01 13:46:10 +02:00
Shikhar Bhardwaj	9f0286bcac	fix: translation example (#2166 ) * fix: translation example Signed-off-by: shikharbhardwaj <8502456+shikharbhardwaj@users.noreply.github.com> * Fix translation example formatting Signed-off-by: shikharbhardwaj <8502456+shikharbhardwaj@users.noreply.github.com> --------- Signed-off-by: shikharbhardwaj <8502456+shikharbhardwaj@users.noreply.github.com>	2025-09-01 11:04:46 +02:00
geoHeil	9904d14e6a	fix: extend offline mode for rapidocr fonts (#2155 ) feat: enable offline mode for docling models Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>	2025-09-01 09:15:47 +02:00
Panos Vagenas	96cab6b536	docs: enrich landing pages (#2165 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-08-29 17:19:05 +02:00
github-actions[bot]	fb3b7b93ae	chore: bump version to 2.48.0 [skip ci] v2.48.0	2025-08-26 05:29:31 +00:00
Cesar Berrospi Ramis	fa3327e1a6	fix(html): preserve code blocks in list items (#2131 ) * chore(html): refactor parser to leverage context managers Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(html): parse inline code snippets, also from list items Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(html): remove hidden tags Remove tags that are not meant to be displayed. Add regression tests for code blocks, inline code, and hidden tags. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-08-26 06:43:48 +02:00
Michele Dolfi	c0268416cf	chore: add analytics (#2133 ) add analytics Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-25 18:25:38 +02:00
Michele Dolfi	d32d2c97e1	chore: PR approval reminder (#2132 ) PR approval reminder Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-25 15:08:37 +02:00
geoHeil	3f60a0fa78	feat: Upgrade to RapidOCR 3.x (#2088 ) * feat: exploring new version * DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com> I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775 Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com> * chore: autoformat DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com> I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775 * feat: enable configurable runtime for rapidocr and handle new result better; Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com> * chore: fix linter Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com> * chore: use new server model * chore: change default engine type to onnx * chore: tests update for new rapidocr * fix: rebase from main and fix clashes * DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com> I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 02f9db85f562e5cdfda40c52fee55cfd4030d70a I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a7bcb205faedb881f94a89b3bbd29cb31ccd54f0 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a39482a98cbcff7a825c8321134732af0c65930a I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63e9d717fa26951566b02761f3fdfc752c31f805 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ef12a6ec1ea2846a8a8e2e776eeaa59c2a0c4dfe Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com> * DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com> I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2222d2340387f8d9d66f3ca9d8e21a0945a44e7a I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bc6a1dc507d7f146ec4797a2d3840414f46ac64d I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 56e0d67da7c57d4b5caf8eaef8dff7056c3efd32 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 871ca21271412006c76acf3c19426140efed3d50 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7b1b77159da729d483a581a86c7309acba1712a7 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a792a714a43e19a91b2b782f54621c1c5efda632 Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com> * DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com> I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d1fed26323ff829b716bc667fe69532839363e45 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 346ec1cad943765f886e5d17fb0a54221124689c I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4d0bbe5bd6e9f7261b97362ff8823af244267089 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 34a5ad53892a7064a6bf35f890d344d464c78b2f I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 9151959db3ad53535011d1cfdcf9181fdf936bb1 I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 8ef5536f2c098826c6c0a05190f8a80614c3f3cb Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com> * DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com> I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: `7e18637a35` I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: `63fb8ff599` I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: `0cb9444fb8` I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: `38940d9978` I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: `b6d461ac42` I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: `ee55eb3408` Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com> --------- Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>	2025-08-25 12:10:33 +02:00
github-actions[bot]	2aef5cf328	chore: bump version to 2.47.1 [skip ci] v2.47.1	2025-08-23 14:11:33 +00:00
Michele Dolfi	488f6cdd2d	fix: vllm extra only for linux x86_64 (#2126 ) vllm extra only for linux x86_64 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-23 13:33:15 +02:00
Raphael Norman-Tenazas	6736e66bb4	style: show converted page count in PaginatedPipeline debug statement (#2124 ) * Show converted page count in PaginatedPipeline debug statement * DCO Remediation Commit for Raphael Norman-Tenazas <tenazasr@gmail.com> I, Raphael Norman-Tenazas <tenazasr@gmail.com>, hereby add my Signed-off-by to this commit: `b7930bf56d` Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com> * Show total progress instead of batch size Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com> --------- Signed-off-by: Raphael Norman-Tenazas <tenazasr@gmail.com>	2025-08-23 12:13:20 +02:00
github-actions[bot]	b04e205d1e	chore: bump version to 2.47.0 [skip ci] v2.47.0	2025-08-22 14:15:39 +00:00
VIktor Kuropiantnyk	cdf079dd06	feat(CLI): Option to download arbitrary HuggingFace model (#2123 ) * Added option to docling-tools to download arbitrary HuggingFace model Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com> * Added note in documentation Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com> * Removed note on custom artifact path usage from HF download option Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com> * Fixed typo Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com> --------- Signed-off-by: Viktor Kuropiatnyk <vku@zurich.ibm.com>	2025-08-22 15:23:29 +02:00
Michele Dolfi	449bde0a6c	test: update docx reference results (#2122 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-22 14:26:36 +02:00
Christoph Auer	3c660c0511	feat: batching support for VLMs in transformers backend, add initial VLLM backend (#2094 ) * Prepare existing codes for use with new multi-stage VLM pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add multithreaded VLM pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add VLM task interpreters Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add VLM task interpreters Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove prints Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix KeyboardInterrupt behaviour Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add VLLM backend support, optimize process_images Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Tweak defaults Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Implement proper batch inference for HuggingFaceTransformersVlmModel Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup hf_transformers_model batching impl Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Adjust example instatiation of multi-stage VLM pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add GoT OCR 2.0 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Factor out changes without multi-stage pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Reset defaults for generation Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add torch.compile, fix temperature setting in gen_kwargs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Expose page_batch_size on CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add torch_dtype bfloat16 to SMOLDOCLING and SMOLVLM model spec Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clip off pad_token Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-08-22 13:17:33 +02:00
Nikhil Verma	3f03709885	fix: Improve numbered list detection for msword docs (#2100 ) * Improve numbered list detection for msword docs This fixes the list detection in MSWord docs by properly tracking and counting the list entries. It fixes https://github.com/docling-project/docling/issues/2090 * DCO Remediation Commit for Nikhil Verma <nikhilgotmail@gmail.com> I, Nikhil Verma <nikhilgotmail@gmail.com>, hereby add my Signed-off-by to this commit: `509da6658e` Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com> --------- Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>	2025-08-22 10:38:34 +02:00
krrome	94fcc46aa9	feat(html): Support formatting tags in HTML texts (#2111 ) * add parsing for formatting tags in HTML backend Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix latest tests + wiki_duck result files. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * convert _collect_parent_format_tags to staticmethod Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>	2025-08-22 10:37:34 +02:00
Maroun Touma	e76298c40d	docs: DPK pipeline example using docling library (#2112 ) * Notebook showing example on how to use docling transforms in DPK Signed-off-by: Maroun Touma <touma@us.ibm.com> * fix HF Token name Signed-off-by: Maroun Touma <touma@us.ibm.com> * use %pip instead of pip install jupyter lab Signed-off-by: Maroun Touma <touma@us.ibm.com> * run formatter Signed-off-by: Maroun Touma <touma@us.ibm.com> * add example to mkdocs and fix typo Signed-off-by: Maroun Touma <touma@us.ibm.com> --------- Signed-off-by: Maroun Touma <touma@us.ibm.com>	2025-08-21 10:14:36 +02:00
Panos Vagenas	8996d612aa	docs: add Getting Started page (#2113 ) * docs: add Getting Started page Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * refactor usage Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * minor renaming Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-08-21 08:44:53 +02:00
github-actions[bot]	555506d8e6	chore: bump version to 2.46.0 [skip ci] v2.46.0	2025-08-20 15:25:07 +00:00
Panos Vagenas	76d2cb76b3	chore: update docling-core lock (#2110 ) * chore: pre-check docling-core 2.45.0 Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update -core pinning Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-08-20 16:41:48 +02:00
Christoph Auer	5f57ff2a45	perf: Clean up resources with docling-parse v4, no parsed_page output by default (#2105 ) * Call PdfDocument.unload_pages from the pipelines where needed, delete parsed_page data unless requested to keep Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * pin docling-parse and update lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Reinstate pipeline_options.generate_parsed_page Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-20 10:46:31 +02:00
Cesar Berrospi Ramis	c5f2e2fdd6	fix(HTML): parse footer tag as a group in furniture content layer (#2106 ) * fix(HTML): parse footer tag as a section in furniture Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(HTML): add test for body vs furniture in HTML parser. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-08-20 08:42:25 +02:00
mohammed ahmed	8820b5558b	perf: speed up function `_parse_orientation` (#1934 ) * ⚡️ Speed up function `_parse_orientation` by 242% Here’s how you should rewrite the code for maximum speed based on your profiler. - The _bottleneck_ is the line ```python orientations = df_osd.loc[df_osd["key"] == "Orientation in degrees"].value.tolist() ``` This does a dataframe filtering (`loc`) and then materializes a list for every call, which is slow. - We can vectorize this search (avoid repeated boolean masking and conversion). - Instead of `.loc[df_osd["key"] == ...].value.tolist()`, use `.at[idx, 'value']` where `idx` is the first index where key matches, or better, `.values[0]` after a fast boolean mask. - Since you only use the first matching value, you don’t need the full filtered column. - You can optimize `parse_tesseract_orientation` by. - Storing `CLIPPED_ORIENTATIONS` as a set for O(1) lookup if it isn't already (can't change the global so just memoize locally). - Remove unnecessary steps. Here is your optimized code: Why is this faster? - `_fast_get_orientation_value`: - Avoids all index alignment overhead of `df.loc`. - Uses numpy arrays under the hood (thanks to `.values`) for direct boolean masking and fast nonzero lookup. - Fetches just the first match directly, skipping conversion to lists. - Only fetches and processes the single cell you actually want. If you’re sure there’s always exactly one match: You can simplify `_fast_get_orientation_value` to. Or, if always sorted and single. --- - No semantics changed. - Comments unchanged unless part modified. This approach should reduce the time spent in `_parse_orientation()` by almost two orders of magnitude, especially as the DataFrame grows. Let me know if you want further micro-optimizations (e.g., Cython, pre-fetched numpy conversions, etc.)! * fix: pandas vet error * DCO Remediation Commit for mohammed <mohammed18200118@gmail.com> I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: `d9824749bb` Signed-off-by: mohammed <mohammed18200118@gmail.com> * Dummy commit to trigger CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: mohammed <mohammed18200118@gmail.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-08-19 10:55:18 +02:00
Michele Dolfi	956f82f115	chore: upgrade dependencies in lock file (#2093 ) * chore: upgrade lock file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix(markdown): update binary hash of a markdown backend ground truth file Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-08-19 10:11:44 +02:00
Matteo	d2494da8b8	feat: new code formula model (#2042 ) * new code formula model Signed-off-by: mao <mao@lenny.zuvela.ibm.com> * new model on hf Signed-off-by: mao <mao@tabby.zuvela.ibm.com> * pre-commits Signed-off-by: mao <mao@login-c.zuvela.ibm.com> * remove MPS Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: mao <mao@lenny.zuvela.ibm.com> Signed-off-by: mao <mao@tabby.zuvela.ibm.com> Signed-off-by: mao <mao@login-c.zuvela.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: mao <mao@lenny.zuvela.ibm.com> Co-authored-by: mao <mao@tabby.zuvela.ibm.com> Co-authored-by: mao <mao@login-c.zuvela.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-18 16:01:46 +02:00
github-actions[bot]	c3a7d1d999	chore: bump version to 2.45.0 [skip ci] v2.45.0	2025-08-18 10:25:51 +00:00
Michele Dolfi	31087f3fcc	feat: add backend for METS with Google Books profile (#1989 ) * add backend for METS with Google Books profile Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Fixes for cell indexing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * use HTMLParser and add options from CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix typing and unloading Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * restore guess format Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename inputformat Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use PdfDocumentBackend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use test file from test folder (still missing) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add test file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-08-18 11:43:20 +02:00
krrome	9687297262	feat(html): Support in-line anchor tags in HTML texts (#1659 ) * re-implement links for html backend. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * implement hack for images. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>	2025-08-18 09:57:16 +02:00

1 2 3 4 5 ...

704 Commits