docling

mirror of https://github.com/DS4SD/docling.git synced 2025-12-08 12:48:28 +00:00

Author	SHA1	Message	Date
Christoph Auer	3c660c0511	feat: batching support for VLMs in transformers backend, add initial VLLM backend (#2094 ) * Prepare existing codes for use with new multi-stage VLM pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add multithreaded VLM pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add VLM task interpreters Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add VLM task interpreters Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove prints Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix KeyboardInterrupt behaviour Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add VLLM backend support, optimize process_images Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Tweak defaults Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Implement proper batch inference for HuggingFaceTransformersVlmModel Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup hf_transformers_model batching impl Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Adjust example instatiation of multi-stage VLM pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add GoT OCR 2.0 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Factor out changes without multi-stage pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Reset defaults for generation Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add torch.compile, fix temperature setting in gen_kwargs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Expose page_batch_size on CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add torch_dtype bfloat16 to SMOLDOCLING and SMOLVLM model spec Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clip off pad_token Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-08-22 13:17:33 +02:00
Nikhil Verma	3f03709885	fix: Improve numbered list detection for msword docs (#2100 ) * Improve numbered list detection for msword docs This fixes the list detection in MSWord docs by properly tracking and counting the list entries. It fixes https://github.com/docling-project/docling/issues/2090 * DCO Remediation Commit for Nikhil Verma <nikhilgotmail@gmail.com> I, Nikhil Verma <nikhilgotmail@gmail.com>, hereby add my Signed-off-by to this commit: `509da6658e` Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com> --------- Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>	2025-08-22 10:38:34 +02:00
krrome	94fcc46aa9	feat(html): Support formatting tags in HTML texts (#2111 ) * add parsing for formatting tags in HTML backend Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix latest tests + wiki_duck result files. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * convert _collect_parent_format_tags to staticmethod Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>	2025-08-22 10:37:34 +02:00
Maroun Touma	e76298c40d	docs: DPK pipeline example using docling library (#2112 ) * Notebook showing example on how to use docling transforms in DPK Signed-off-by: Maroun Touma <touma@us.ibm.com> * fix HF Token name Signed-off-by: Maroun Touma <touma@us.ibm.com> * use %pip instead of pip install jupyter lab Signed-off-by: Maroun Touma <touma@us.ibm.com> * run formatter Signed-off-by: Maroun Touma <touma@us.ibm.com> * add example to mkdocs and fix typo Signed-off-by: Maroun Touma <touma@us.ibm.com> --------- Signed-off-by: Maroun Touma <touma@us.ibm.com>	2025-08-21 10:14:36 +02:00
Panos Vagenas	8996d612aa	docs: add Getting Started page (#2113 ) * docs: add Getting Started page Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * refactor usage Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * minor renaming Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-08-21 08:44:53 +02:00
github-actions[bot]	555506d8e6	chore: bump version to 2.46.0 [skip ci] v2.46.0	2025-08-20 15:25:07 +00:00
Panos Vagenas	76d2cb76b3	chore: update docling-core lock (#2110 ) * chore: pre-check docling-core 2.45.0 Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update -core pinning Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-08-20 16:41:48 +02:00
Christoph Auer	5f57ff2a45	perf: Clean up resources with docling-parse v4, no parsed_page output by default (#2105 ) * Call PdfDocument.unload_pages from the pipelines where needed, delete parsed_page data unless requested to keep Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * pin docling-parse and update lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Reinstate pipeline_options.generate_parsed_page Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-20 10:46:31 +02:00
Cesar Berrospi Ramis	c5f2e2fdd6	fix(HTML): parse footer tag as a group in furniture content layer (#2106 ) * fix(HTML): parse footer tag as a section in furniture Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(HTML): add test for body vs furniture in HTML parser. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-08-20 08:42:25 +02:00
mohammed ahmed	8820b5558b	perf: speed up function `_parse_orientation` (#1934 ) * ⚡️ Speed up function `_parse_orientation` by 242% Here’s how you should rewrite the code for maximum speed based on your profiler. - The _bottleneck_ is the line ```python orientations = df_osd.loc[df_osd["key"] == "Orientation in degrees"].value.tolist() ``` This does a dataframe filtering (`loc`) and then materializes a list for every call, which is slow. - We can vectorize this search (avoid repeated boolean masking and conversion). - Instead of `.loc[df_osd["key"] == ...].value.tolist()`, use `.at[idx, 'value']` where `idx` is the first index where key matches, or better, `.values[0]` after a fast boolean mask. - Since you only use the first matching value, you don’t need the full filtered column. - You can optimize `parse_tesseract_orientation` by. - Storing `CLIPPED_ORIENTATIONS` as a set for O(1) lookup if it isn't already (can't change the global so just memoize locally). - Remove unnecessary steps. Here is your optimized code: Why is this faster? - `_fast_get_orientation_value`: - Avoids all index alignment overhead of `df.loc`. - Uses numpy arrays under the hood (thanks to `.values`) for direct boolean masking and fast nonzero lookup. - Fetches just the first match directly, skipping conversion to lists. - Only fetches and processes the single cell you actually want. If you’re sure there’s always exactly one match: You can simplify `_fast_get_orientation_value` to. Or, if always sorted and single. --- - No semantics changed. - Comments unchanged unless part modified. This approach should reduce the time spent in `_parse_orientation()` by almost two orders of magnitude, especially as the DataFrame grows. Let me know if you want further micro-optimizations (e.g., Cython, pre-fetched numpy conversions, etc.)! * fix: pandas vet error * DCO Remediation Commit for mohammed <mohammed18200118@gmail.com> I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: `d9824749bb` Signed-off-by: mohammed <mohammed18200118@gmail.com> * Dummy commit to trigger CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: mohammed <mohammed18200118@gmail.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-08-19 10:55:18 +02:00
Michele Dolfi	956f82f115	chore: upgrade dependencies in lock file (#2093 ) * chore: upgrade lock file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix(markdown): update binary hash of a markdown backend ground truth file Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-08-19 10:11:44 +02:00
Matteo	d2494da8b8	feat: new code formula model (#2042 ) * new code formula model Signed-off-by: mao <mao@lenny.zuvela.ibm.com> * new model on hf Signed-off-by: mao <mao@tabby.zuvela.ibm.com> * pre-commits Signed-off-by: mao <mao@login-c.zuvela.ibm.com> * remove MPS Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: mao <mao@lenny.zuvela.ibm.com> Signed-off-by: mao <mao@tabby.zuvela.ibm.com> Signed-off-by: mao <mao@login-c.zuvela.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: mao <mao@lenny.zuvela.ibm.com> Co-authored-by: mao <mao@tabby.zuvela.ibm.com> Co-authored-by: mao <mao@login-c.zuvela.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-18 16:01:46 +02:00
github-actions[bot]	c3a7d1d999	chore: bump version to 2.45.0 [skip ci] v2.45.0	2025-08-18 10:25:51 +00:00
Michele Dolfi	31087f3fcc	feat: add backend for METS with Google Books profile (#1989 ) * add backend for METS with Google Books profile Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Fixes for cell indexing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * use HTMLParser and add options from CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix typing and unloading Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * restore guess format Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename inputformat Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use PdfDocumentBackend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use test file from test folder (still missing) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add test file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-08-18 11:43:20 +02:00
krrome	9687297262	feat(html): Support in-line anchor tags in HTML texts (#1659 ) * re-implement links for html backend. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * implement hack for images. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>	2025-08-18 09:57:16 +02:00
Eric Deandrea	76c1fbd6e8	docs: Add docling Quarkus integration (#2083 ) * Add docling Quarkus integration * DCO Remediation Commit for Eric Deandrea <eric.deandrea@ibm.com> I, Eric Deandrea <eric.deandrea@ibm.com>, hereby add my Signed-off-by to this commit: `86aa0b80f4` Signed-off-by: Eric Deandrea <eric.deandrea@ibm.com> --------- Signed-off-by: Eric Deandrea <eric.deandrea@ibm.com>	2025-08-18 06:55:51 +02:00
Shkarupa Alex	5f050f94e1	feat(vlm): Ability to preprocess VLM response (#1907 ) * Add ability to preprocess VLM response Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com> * Move response decoding to vlm options (requires inheritance to override). Per-page prompt formulation also moved to vlm options to keep api consistent. Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com> --------- Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>	2025-08-12 15:20:24 +02:00
github-actions[bot]	ccfee05847	chore: bump version to 2.44.0 [skip ci] v2.44.0	2025-08-12 09:51:35 +00:00
Peter W. J. Staar	b09033cb73	feat: add convert_string to document-converter (#2069 ) * feat: add convert_string to document-converter Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fix unsupported operand type(s) for \|: type and NoneType Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tests for convert_string Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-08-12 11:02:38 +02:00
Panos Vagenas	e2cca931be	docs: add Langflow integration (#2068 ) * docs: add langflow integration Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * fix link Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-08-11 16:03:29 +02:00
Maroun Touma	ed56f2de5d	fix(html): Parse rawspan and colspan when they include non numerical values (#2048 ) * use re to stop at first non-digit Signed-off-by: Maroun Touma <touma@us.ibm.com> * Allow digit in first place followed by non numerical values Signed-off-by: Maroun Touma <touma@us.ibm.com> * refactor to match type checker Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Maroun Touma <touma@us.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-11 13:53:29 +02:00
Thomas Vitale	bfda6d34d8	docs: Add Arconia integration (#2061 ) Signed-off-by: Thomas Vitale <ThomasVitale@users.noreply.github.com>	2025-08-08 09:35:47 +02:00
Michele Dolfi	c5f49dc2db	chore: upgrade locked dependencies (#2024 ) lock new deps Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-07-31 16:05:27 +02:00
TwoLeaves	0130e3ae96	fix: support new mlx-vlm module (#2001 ) * fix stream_generate import statement Signed-off-by: TwoLeaves <ohneherren@gmail.com> * pin new mlx-vlm Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: TwoLeaves <ohneherren@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-07-31 14:13:17 +02:00
Michele Dolfi	2eb760d060	fix: extend error reporting when verbose logging is enabled (#2017 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-07-30 11:23:26 +02:00
Cesar Berrospi Ramis	86f70128aa	fix(HTML): replace non-standard Unicode characters (#2006 ) chore(HTML): replace non-standard Unicode characters for beter downstream tasks Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-29 11:05:35 +02:00
github-actions[bot]	aae42b37a8	chore: bump version to 2.43.0 [skip ci] v2.43.0	2025-07-28 09:45:53 +00:00
Christoph Auer	aed772ab33	feat: Threaded PDF pipeline (#1951 ) * Initial async pdf pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * UpstreamAwareQueue Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Refactoring into async pipeline primitives and graph Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanups and safety improvements Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Better threaded PDF pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin docling-ibm-models Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unused args Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Revise pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Unload doc backend Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Revert "Unload doc backend" This reverts commit `01066f0b6e`. * Remove redundant method Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update threaded test Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal> * Stop accumulating docs in test run Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix: don't starve on docs with > max_queue_size pages Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix: don't starve on docs with > max_queue_size pages Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com> I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: `fa71cde950` I, Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>, hereby add my Signed-off-by to this commit: `d66da87d96` Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix: python3.9 compat Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Option to enable threadpool with doc_batch_concurrency setting Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clean up unused code Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix settings defaults expectations Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Use released docling-ibm-models Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove ignores for typing/linting Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>	2025-07-26 11:49:37 +02:00
Cesar Berrospi Ramis	aec29a7315	fix(markdown): ensure correct parsing of nested lists (#1995 ) * fix(markdown): ensure correct parsing of nested lists Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: update dependencies in uv.lock file Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-25 15:17:57 +02:00
Christoph Auer	1985841a19	ci: Fixes for test GT (#1992 ) Fixes for test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-25 12:28:06 +02:00
Cesar Berrospi Ramis	945721a15d	fix(HTML): remove an unnecessary print command (#1988 ) Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-25 08:45:15 +02:00
github-actions[bot]	8227841c1b	chore: bump version to 2.42.2 [skip ci] v2.42.2	2025-07-24 10:21:10 +00:00
Cesar Berrospi Ramis	5132f061a8	fix(HTML): concatenation of child strings in table cells and list items (#1981 ) fix(HTML): ensure correct concatenation of child strings in table cells and list items Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-24 11:19:25 +02:00
Michele Dolfi	7b5f86098d	docs: add chat with dosu (#1984 ) add chat with dosu Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-07-24 11:07:36 +02:00
Rafael Teixeira de Lima	0b83609531	fix(docx): Adding plain latex equations to table cells (#1986 ) * Adding plain latex equations to table cells Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding test files Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>	2025-07-24 11:02:24 +02:00
Copilot	98e2fcff63	fix: Preserve PARTIAL_SUCCESS status when document timeout hits (#1975 ) * Initial plan * Initial investigation: analyze ReadingOrderModel timeout issue Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * Complete timeout fix validation with tests and documentation Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * Fix timeout status preservation issue by extending _determine_status method Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * Fix the PARTIAL_SUCCESS case in _determine_status properly Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-23 13:50:40 +02:00
Copilot	8d50a59d48	fix: multi-page image support (tiff) (#1928 ) * Initial plan * Fix multi-page TIFF image support Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * add RGB conversion Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove pointless test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add multi-page TIFF test data and verification tests Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * Revert "Add multi-page TIFF test data and verification tests" This reverts commit `130a10e2d9`. * Proper test for 2 page tiff file Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * DCO Remediation Commit for copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: `420df478f3` I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: `c1d722725f` I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: `6aa85cc933` I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: `130a10e2d9` I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: `d571f36299` I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: `2aab66288b` Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Proper test for 2 page tiff file (2) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-23 09:55:40 +02:00
github-actions[bot]	ec971bbe68	chore: bump version to 2.42.1 [skip ci] v2.42.1	2025-07-22 16:45:48 +00:00
Christoph Auer	67441ca418	fix: Keep formula clusters also when empty (#1970 ) Keep formula clusters also when empty Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-22 17:02:12 +02:00
Michele Dolfi	90a7cc4bdd	docs: enrich existing DoclingDocument (#1969 ) add example for enriching an existing doclingdocument Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-07-22 16:20:15 +02:00
Cesar Berrospi Ramis	a069b1175b	refactor(HTML): handle text from styled html (#1960 ) * A new HTML backend that handles styled html (ignors it) as well as images. Images are parsed as placeholders with a caption, if it exists. Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com> Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> * tests(HTML): re-enable test_ordered_lists Re-enable test_ordered_lists regression test for the HTML backend since docling-core now supports ordered lists with custom start value. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>	2025-07-22 13:16:31 +02:00
Fabiano Franz	5d98bcea1b	docs: add documentation for confidence scores (#1912 ) * docs: add documentation for confidence scores Signed-off-by: Fabiano Franz <contact@fabianofranz.com> * Increase focus on confidence grades, scores are informational only Signed-off-by: Fabiano Franz <contact@fabianofranz.com> * Update confidence_scores.md Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> --------- Signed-off-by: Fabiano Franz <contact@fabianofranz.com> Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2025-07-21 10:16:17 +02:00
github-actions[bot]	7561be537a	chore: bump version to 2.42.0 [skip ci] v2.42.0	2025-07-18 15:34:59 +00:00
Christoph Auer	cca05c45ea	fix: Safe pipeline init, use device_map in transformers models (#1917 ) * Use device_map for transformer models Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add accelerate Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Relax accelerate min version Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Make pipeline cache+init thread-safe Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-18 15:14:36 +02:00
Cesar Berrospi Ramis	e1e3053695	fix: fix HTML table parser and JATS backend bugs (#1948 ) Fix a bug in parsing HTML tables in HTML backend. Fix a bug in test file that prevented JATS backend tests. Ensure that the JATS backend creates headings with the right level. Remove unnecessary data files for testing JATS backend. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-16 10:49:24 +02:00
stephencox-ict	d6d2dbe2f9	docs: Fix typos (#1943 ) Fix typos Signed-off-by: stephencox-ict <scox@ict.co>	2025-07-15 09:51:56 +02:00
Christoph Auer	a436be7367	feat: Add option to control empty clusters in layout postprocessing (#1940 ) Add option to control empty clusters in layout postprocessing Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-14 18:32:01 +02:00
Copilot	95e70962f1	fix: KeyError: 'fPr' when processing latex fractions in DOCX files (#1926 ) * Initial plan * Initial analysis and fix for KeyError: 'fPr' in OMML fraction processing Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * Add comprehensive test for OMML fraction fPr fix Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * Use debug logging, remove unnecesary test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-11 09:52:14 +02:00
Copilot	c5fb353f10	fix: Change granite vision model URL from preview to stable version (#1925 ) * Initial plan * Fix granite vision model URL from preview to stable version Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * Update to granite vision 3.3 Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Update to granite vision 3.3 (2) Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> --------- Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>	2025-07-11 08:46:03 +02:00
github-actions[bot]	6c4bf9d087	chore: bump version to 2.41.0 [skip ci] v2.41.0	2025-07-10 14:25:05 +00:00

1 2 3 4 5 ...

619 Commits