docling

mirror of https://github.com/DS4SD/docling.git synced 2025-12-08 20:58:11 +00:00

Author	SHA1	Message	Date
Animesh	db985bb159	fix(asr): Implement robust status check in AsrPipeline (#2442 ) * test: Add failing test case for silent audio file * fix: Implement robust status check in AsrPipeline * DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com>I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: 5fc4d512b330bb0cd347da4cbcca0fbe9687898aI, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: `31a4e9a5f1` Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com> * DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com> I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: `5fc4d512b3` I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: `31a4e9a5f1` Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com> * DCO Remediation Commit for mastermaxx03 <srivastavaanimesh22@gmail.com> I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: `5fc4d512b3` I, mastermaxx03 <srivastavaanimesh22@gmail.com>, hereby add my Signed-off-by to this commit: `31a4e9a5f1` Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com> --------- Signed-off-by: mastermaxx03 <srivastavaanimesh22@gmail.com>	2025-10-13 09:51:31 +02:00
Cesar Berrospi Ramis	cce18b2ff7	fix: deal with chartsheets in workbooks (#2433 ) * fix(xlsx): deal with chartsheets in workbooks Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * tests(xlsx): align test file names Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-10-10 15:06:38 +02:00
Maxim Lysak	9705f4020c	fix: Proper heading support in rich tables for HTML backend (#2394 ) * Fix for the proper headers support in rich tables in HTML Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * cleaning up Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Compatibility with older Python versions Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixing Furniture before the first heading rule Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added minimalistic test case Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * added html for the test Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-10-07 15:57:32 +02:00
Matvei Smirnov	ee73ffae15	fix(markdown): Setext heading support (#2359 ) Signed-off-by: Matvei Smirnov <vdalekesmirnov@gmail.com> Co-authored-by: Matvei Smirnov <matvei.smirnov@vkteam.ru>	2025-10-03 10:32:53 +02:00
Michele Dolfi	9505202e38	ci: update docling-parse and remove pages.json (#2372 ) * update docling-parse and remove pages.json Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * ocr gt Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-10-03 09:53:13 +02:00
Christoph Auer	ca2be7ff3a	fix: Empty table handling (#2365 ) * add table raw cells when no table structure model was used Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Add RichTableCell instance for tables with missing structure. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * update test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-10-02 19:35:16 +02:00
Michele Dolfi	4f295ed051	fix: add table raw content when no table structure model is used (#1815 ) * add table raw cells when no table structure model was used Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Add RichTableCell instance for tables with missing structure. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-10-02 13:46:42 +02:00
Maxim Lysak	c803abed9a	feat: Rich tables support for HTML backend (#2324 ) * Rich tables support for HTML backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Decoupling JATS backend from HTML backend, ways of creating tables changed significantly Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * updated and added tests Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Refactored parse_table_data in html_backend into few smaller functions Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Changing scope of few functions in html_backend.py, making them static, when possible Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-09-29 18:12:16 +02:00
Lucas Morin	9d67bb9ed6	fix: support escaped characters in markdown backend (#2304 ) fix: improve markdown backend to support input documents with escaped characters Signed-off-by: Lucas Morin <lucas.morin222@gmail.com>	2025-09-23 18:00:16 +02:00
Maxim Lysak	e2482a2ada	feat: Rich tables for MSWord backend (#2291 ) * Adding support of rich table cells to MSWord backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixes for properly accounting lists, pictures and headers in rich table cells Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Cleaned up msword backend, re-generated docx tests Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added detection of simple table cells in word backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Cleaned up Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-09-22 16:41:59 +02:00
Cesar Berrospi Ramis	46efaaefee	feat: add a backend parser for WebVTT files (#2288 ) * feat: add a backend parser for WebVTT files Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: update README with VTT support Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: add description to supported formats Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore: upgrade docling-core to unescape WebVTT in markdown Pin the new release of docling-core 2.48.2. Do not escape HTML reserved characters when exporting WebVTT documents to markdown. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test: add missing copyright notice Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-09-22 15:24:34 +02:00
Michele Dolfi	ad2f738231	chore: update lock (#2265 ) * update lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update changes from docling-core update Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-09-15 11:19:15 +02:00
Peter W. J. Staar	b3d7542061	feat: updated the backend for new docling-parse (#2187 ) * updated the backend and pyproject.toml Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the version and test files Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> * forgot to add 1 updated test-file Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-09-05 10:42:31 +02:00
Nikos Livathinos	e38aa0f7f2	feat: Heron layout model as new default (#1971 ) * feat: Switch default layout model to DOCLING_LAYOUT_HERON. Update the unit test data. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Use default layout model in model_downloader default args Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Use default layout model in model_downloader default args Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update docling-models tag for TableFormer Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT (from linux CPU) Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal> * fix: Ensure that the visualisations happen on copies of the page image Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Pinpoint docling-ibm-models to the fix branch for the ReadingOrderPredictor Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update uv.lock Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update tests GT to match the Heron layout model and the improved reading order model in Linux Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Introduce the verify_doctags optional parameter in conversion tests to control if a doctags comparison should take place. Skip doctags comparisons for certain tests. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Generate tests GT on Mac Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Remove the pinning of the docling-ibm-models and use the release 3.9.1 Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>	2025-09-03 12:45:22 +02:00
Qiefan Jiang	a283ccff25	feat(msexcel): set ContentLayer.INVISIBLE for invisible sheet (#1876 ) * feat(msexcel): ignore invisible sheet * DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com> I, Qiefan Jiang <jiangqiefan@bytedance.com>, hereby add my Signed-off-by to this commit: ca391f4908f44f301de54a97057f0b809f5ce66c Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> * retain invisible sheet with ContentLayer.INVISIBLE Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> * update UT Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> * fix: use Optional for python3.9 Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> * DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com> I, Qiefan Jiang <jiangqiefan@bytedance.com>, hereby add my Signed-off-by to this commit: `a34371a90e` Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com> --------- Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com>	2025-09-01 13:53:45 +02:00
Panos Vagenas	be26044f14	chore: update docling-core lock (#2169 ) * chore: upgrade docling-core Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * upgrade lock Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-09-01 13:46:10 +02:00
Cesar Berrospi Ramis	fa3327e1a6	fix(html): preserve code blocks in list items (#2131 ) * chore(html): refactor parser to leverage context managers Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(html): parse inline code snippets, also from list items Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(html): remove hidden tags Remove tags that are not meant to be displayed. Add regression tests for code blocks, inline code, and hidden tags. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-08-26 06:43:48 +02:00
Michele Dolfi	449bde0a6c	test: update docx reference results (#2122 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-08-22 14:26:36 +02:00
Nikhil Verma	3f03709885	fix: Improve numbered list detection for msword docs (#2100 ) * Improve numbered list detection for msword docs This fixes the list detection in MSWord docs by properly tracking and counting the list entries. It fixes https://github.com/docling-project/docling/issues/2090 * DCO Remediation Commit for Nikhil Verma <nikhilgotmail@gmail.com> I, Nikhil Verma <nikhilgotmail@gmail.com>, hereby add my Signed-off-by to this commit: `509da6658e` Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com> --------- Signed-off-by: Nikhil Verma <nikhilgotmail@gmail.com>	2025-08-22 10:38:34 +02:00
krrome	94fcc46aa9	feat(html): Support formatting tags in HTML texts (#2111 ) * add parsing for formatting tags in HTML backend Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix latest tests + wiki_duck result files. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * convert _collect_parent_format_tags to staticmethod Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>	2025-08-22 10:37:34 +02:00
Panos Vagenas	76d2cb76b3	chore: update docling-core lock (#2110 ) * chore: pre-check docling-core 2.45.0 Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update -core pinning Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-08-20 16:41:48 +02:00
Cesar Berrospi Ramis	c5f2e2fdd6	fix(HTML): parse footer tag as a group in furniture content layer (#2106 ) * fix(HTML): parse footer tag as a section in furniture Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(HTML): add test for body vs furniture in HTML parser. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-08-20 08:42:25 +02:00
Michele Dolfi	956f82f115	chore: upgrade dependencies in lock file (#2093 ) * chore: upgrade lock file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix(markdown): update binary hash of a markdown backend ground truth file Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>	2025-08-19 10:11:44 +02:00
Michele Dolfi	31087f3fcc	feat: add backend for METS with Google Books profile (#1989 ) * add backend for METS with Google Books profile Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Fixes for cell indexing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * use HTMLParser and add options from CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix typing and unloading Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * restore guess format Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename inputformat Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use PdfDocumentBackend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use test file from test folder (still missing) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add test file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-08-18 11:43:20 +02:00
krrome	9687297262	feat(html): Support in-line anchor tags in HTML texts (#1659 ) * re-implement links for html backend. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> * implement hack for images. Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> --------- Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>	2025-08-18 09:57:16 +02:00
Cesar Berrospi Ramis	86f70128aa	fix(HTML): replace non-standard Unicode characters (#2006 ) chore(HTML): replace non-standard Unicode characters for beter downstream tasks Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-29 11:05:35 +02:00
Cesar Berrospi Ramis	aec29a7315	fix(markdown): ensure correct parsing of nested lists (#1995 ) * fix(markdown): ensure correct parsing of nested lists Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: update dependencies in uv.lock file Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-25 15:17:57 +02:00
Christoph Auer	1985841a19	ci: Fixes for test GT (#1992 ) Fixes for test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-25 12:28:06 +02:00
Cesar Berrospi Ramis	5132f061a8	fix(HTML): concatenation of child strings in table cells and list items (#1981 ) fix(HTML): ensure correct concatenation of child strings in table cells and list items Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-24 11:19:25 +02:00
Rafael Teixeira de Lima	0b83609531	fix(docx): Adding plain latex equations to table cells (#1986 ) * Adding plain latex equations to table cells Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding test files Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>	2025-07-24 11:02:24 +02:00
Copilot	8d50a59d48	fix: multi-page image support (tiff) (#1928 ) * Initial plan * Fix multi-page TIFF image support Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * add RGB conversion Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove pointless test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add multi-page TIFF test data and verification tests Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> * Revert "Add multi-page TIFF test data and verification tests" This reverts commit `130a10e2d9`. * Proper test for 2 page tiff file Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * DCO Remediation Commit for copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: `420df478f3` I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: `c1d722725f` I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: `6aa85cc933` I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: `130a10e2d9` I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: `d571f36299` I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: `2aab66288b` Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Proper test for 2 page tiff file (2) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-23 09:55:40 +02:00
Cesar Berrospi Ramis	a069b1175b	refactor(HTML): handle text from styled html (#1960 ) * A new HTML backend that handles styled html (ignors it) as well as images. Images are parsed as placeholders with a caption, if it exists. Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com> Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> * tests(HTML): re-enable test_ordered_lists Re-enable test_ordered_lists regression test for the HTML backend since docling-core now supports ordered lists with custom start value. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com> Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>	2025-07-22 13:16:31 +02:00
Cesar Berrospi Ramis	e1e3053695	fix: fix HTML table parser and JATS backend bugs (#1948 ) Fix a bug in parsing HTML tables in HTML backend. Fix a bug in test file that prevented JATS backend tests. Ensure that the JATS backend creates headings with the right level. Remove unnecessary data files for testing JATS backend. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-07-16 10:49:24 +02:00
Christoph Auer	cc6193b3b9	test: Update tests to use default PDF backend (DPv4) (#1923 ) * Update tests to use default PDF backend (DPv4) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * OCR tests use DPv1 until rotation bugs are fixed Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-10 15:16:56 +02:00
Christoph Auer	2b8616d6d5	feat: Layout model specification and multiple choices (#1910 ) * Establish layout_model spec and example instantations Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated naming Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Back to uppercase constants Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix deps issue with openai-whipser>numba>llvmlite Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pull v1 changed test GT from main Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-10 06:37:27 +02:00
Panos Vagenas	ec588df971	feat: enable precision control in float serialization (#1914 ) * chore: propagate precision control in float serialization Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * parametrize float serialization, propagate core updates Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update test float precision Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * repin docling-core Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-07-09 16:39:17 +02:00
Christoph Auer	56a0e104f7	feat: Integrate ListItemMarkerProcessor into document assembly (#1825 ) * Integrate ListItemMarkerProcessor into document assembly Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update to final version Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update all test cases Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Upgrade deps Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-07-01 10:04:58 +02:00
Panos Vagenas	0533da1923	feat: leverage new list modeling, capture default markers (#1856 ) * chore: update docling-core & regenerate test data Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update backends to leverage new list modeling Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * repin docling-core Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * ensure availability of latest docling-core API Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-06-27 16:37:15 +02:00
Michael Honaker	e79e4f0ab6	fix(markdown): make parsing of rich table cells valid (#1821 ) * fix: update md table classification Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com> * Fix ground truth header changes Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com> * Fix merge issues Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com> * Fix minor ground truth errors Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com> --------- Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>	2025-06-26 19:50:45 +02:00
Panos Vagenas	7c5614a37a	fix(markdown): fix single-formatted headings & list items (#1820 ) * fix(markdown): fix formatting & inline edge cases (show behavior before change) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * add change and updated test data Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update lock Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * improve test case Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-06-25 13:05:06 +02:00
Peter W. J. Staar	1557e7ce3e	feat: Support audio input (#1763 ) * scaffolding in place Signed-off-by: Peter Staar <taa@zurich.ibm.com> * doing scaffolding for audio pipeline Signed-off-by: Peter Staar <taa@zurich.ibm.com> * WIP: got first transcription working Signed-off-by: Peter Staar <taa@zurich.ibm.com> * all working, time to start cleaning up Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first working ASR pipeline Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added openai-whisper as a first transcription model Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updating with asr_options Signed-off-by: Peter Staar <taa@zurich.ibm.com> * finalised the first working ASR pipeline with Whisper Signed-off-by: Peter Staar <taa@zurich.ibm.com> * use whisper from the latest git commit Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update docling/datamodel/pipeline_options.py Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com> * Update docling/datamodel/pipeline_options.py Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com> * updated comment Signed-off-by: Peter Staar <taa@zurich.ibm.com> * AudioBackend -> DummyBackend Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * file rename Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Rename to NoOpBackend, add test for ASR pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Support every format in NoOpBackend Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add missing audio file and test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Install ffmpeg system dependency for ASR test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-23 14:47:26 +02:00
Cesar Berrospi Ramis	d26dac61a8	fix(docx): ensure list items have a list parent (#1827 ) Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-06-20 14:47:25 +02:00
mkrssg	1350a8d3e5	fix(msword_backend): Identify text in the same line after an image #1425 (#1610 ) * fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * fix: extraneous empty paragraphs for test files Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> --------- Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>	2025-06-20 10:55:30 +02:00
Panos Vagenas	861abcdcb0	feat(markdown): add formatting & improve inline support (#1804 ) feat(markdown): support formatting & hyperlinks Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-06-18 15:57:57 +02:00
Martin Wind	f28d23cf03	fix: pptx line break and space handling (#1664 ) Signed-off-by: Martin Wind <martin.wind@im-c.at>	2025-06-16 10:44:30 +02:00
Cesar Berrospi Ramis	b886e4df31	fix(asciidoc): set default size when missing in image directive (#1769 ) The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values. Refactor static methods as such and add the staticmethod decorator. Extend the regression test for this fix. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-06-16 10:38:46 +02:00
Christoph Auer	7d3302cb48	feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 ) * Keep page.parsed_page.textline_cells and page.cells in sync, including OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Make page.parsed_page the only source of truth for text cells Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small fix Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Correctly compute PDF boxes from pymupdf Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Use different OCR engine order Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add type hints and fix mypy Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * One more test fix Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove with pypdfium2_lock from caller sites Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix typing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-13 19:01:55 +02:00
Ayraf	df140227c3	feat: support xlsm files (#1520 ) * code for xlsm support * updated support for xlsm * updated code for xlsm support * Update docling_parse_v4_backend.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update docling_parse_v4_backend.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update test_backend_msexcel_xlsm.py updated the tests/test_backend_msexcel_xlsm.py: have a function starting with test removed all print statements ** To add an explicit assert {test}=={pred} Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update base_models.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update test_backend_msexcel.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update test_backend_msexcel_xlsm.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update document_converter.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Delete tests/test_backend_msexcel_xlsm.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * xlsm file Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * run tests * ran tests * Fix tests, upgrade XSLM example to a valid file Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-10 16:55:59 +02:00
Peter W. J. Staar	6613b9e98b	fix: prov for merged-elems (#1728 ) * fix: prov for merged-elems Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Reset pyproject.toml Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-10 11:22:42 +02:00
Panos Vagenas	61d0d6c755	test: mark flaky test (#1698 ) * test: cleanse Word test file Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * mark textbox file test as flaky Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * fix path usage Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-06-03 13:13:44 +02:00

1 2 3

124 Commits