docling

mirror of https://github.com/DS4SD/docling.git synced 2025-08-02 15:32:30 +00:00

Author	SHA1	Message	Date
Michele Dolfi	8e5ecad9c9	use latest docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-06 16:33:25 +01:00
Michele Dolfi	9097f6d099	pin wheel of latest docling-parse release Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-06 16:20:40 +01:00
Michele Dolfi	81a6d16ae7	add test data results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-06 16:18:30 +01:00
Michele Dolfi	23e82a5f49	fix example filepaths Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-06 16:12:52 +01:00
Michele Dolfi	6d801eff55	Merge remote-tracking branch 'origin/main' into multiple-updates Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-06 15:58:30 +01:00
Michele Dolfi	69e8a9d499	fix mypy reports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-06 15:55:46 +01:00
Michele Dolfi	ed74fe2ec0	feat: new artifacts path and CLI utility (#876 ) * fix artifacts path Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add docling-models utility Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * missing formatting Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename utility to docling-tools Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename download methods and deprecation warnings Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * propagate artifacts path usage for ocr models Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move function to utils Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove unused file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * simplify downloading specific model(s) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * minor refactor Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-02-06 15:46:32 +01:00
Michele Dolfi	fce6bb14db	Merge remote-tracking branch 'origin/dev/add-r2l-tests' into multiple-updates Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-06 14:33:18 +01:00
Michele Dolfi	6ccff9a299	update lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-06 14:21:27 +01:00
Vladimir Gurevich	722a6eb7b9	fix(msword_backend): handle conversion error in label parsing (#896 ) Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors. Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com> Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com>	2025-02-06 12:30:51 +01:00
Matteo-Omenetti	6fd8666dc1	new test file Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>	2025-02-05 22:21:31 +01:00
Christoph Auer	7bdd6868ed	Add code to expose text direction of cell Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-02-05 12:48:12 +01:00
Michele Dolfi	5ad6de0560	fix: enrichment models batch size and expose picture classifier (#878 ) * expose picture classifier in CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use different batch size in each model Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove batch size from CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * cleanup imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-05 11:46:01 +01:00
Matteo-Omenetti	9f6aa036b1	added new gt for test_e2e_conversion Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>	2025-02-05 10:26:10 +01:00
Matteo-Omenetti	8040a4f19d	added new gt for test_e2e_conversion Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>	2025-02-04 15:24:49 +01:00
Matteo-Omenetti	297e837719	fix black Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>	2025-02-04 14:56:28 +01:00
Peter Staar	d7c9874a88	added three test-files for right-to-left Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-02-04 14:49:19 +01:00
Matteo-Omenetti	68d1713802	switch to code formula model v1.0.1 and new test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>	2025-02-04 14:12:42 +01:00
Peter Staar	5db82d5b67	cleaned up the data folder in the tests Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-02-04 13:50:19 +01:00
Matteo-Omenetti	89844a5725	switch to code formula model v1.0.1 and new test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>	2025-02-04 13:29:02 +01:00
Matteo-Omenetti	48c57144d2	switch to code formula model v1.0.1 and new test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>	2025-02-04 13:28:13 +01:00
Panos Vagenas	17448163e7	chore: fix docs search (#880 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-02-04 11:35:34 +01:00
Nikos Livathinos	6d3fea0196	docs: Introduce example with custom models for RapidOCR (#874 ) * docs: Introduce example with custom models for RapidOCR Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Exclude the example with custom RapidOCR models from the examples to run in github actions Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2025-02-04 10:07:00 +01:00
github-actions[bot]	b5da4080c9	chore: bump version to 2.18.0 [skip ci]	2025-02-03 14:58:50 +00:00
Panos Vagenas	5ac2887e4a	fix(markdown): fix parsing if doc ending with table (#873 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-02-03 14:38:38 +01:00
Panos Vagenas	a40544a546	chore: clean up top-level file (#872 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-02-03 14:10:12 +01:00
Panos Vagenas	94751a78f4	fix(markdown): add support for HTML content (#855 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-02-03 12:21:05 +01:00
Michele Dolfi	6a76b49a47	feat: Expose equation exports (#869 ) * pin new docling-core and exploit it via assembler changes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update with docling-core release Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-03 10:31:19 +01:00
Cesar Berrospi Ramis	0cd81a8122	fix(docx): merged table cells not properly converted (#857 ) * fix(docx): merged cells not properly converted Fix conversion issue of merged cells in Word tables leading to repeated text. Simplify Word table conversion code. Add docx file with several table formats for regression tests. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: add type hinting to docx backend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-03 10:20:03 +01:00
Maxim Lysak	eff16b62cc	fix: Processing of placeholder shapes in pptx that have text but no bbox (#868 ) Processing of placeholder shapes in pptx that have text but no bbox Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-02-03 09:33:33 +01:00
Maxim Lysak	b1cf796730	fix: KeyError in tableformer prediction (#854 ) * fix for KeyError in tableformer prediction Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * chore: rewrite cumbersome dictionary checking Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-31 17:00:14 +01:00
Christoph Auer	70d68b6164	feat: Add option to define page range (#852 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-31 15:23:00 +01:00
Maxim Lysak	d727b04ad0	feat(docx): Support of SDTs in docx backend (#853 ) Support of table of content containers in docx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-01-31 14:52:24 +01:00
Maxim Lysak	2c037ae62e	fix: Fixed docx import with headers that are also lists (#842 ) * Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Update docling/backend/msword_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> * Update docling/backend/msword_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-31 10:51:21 +01:00
Michele Dolfi	2a1f8afe7e	fix: use new add_code in html backend and add more typing hints (#850 ) fix add_code in html backend and add more typing hints Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-31 09:54:17 +01:00
Michele Dolfi	4df085aa6c	feat: Python 3.13 support (#841 ) * test: update results with new docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update all deps in the lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix table in test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix version for python3.13 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * latest poetry version in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * activate py3.13 in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update docs about python 3.13 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * test with rapidocr only on python <3.13 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-30 17:26:42 +01:00
Panos Vagenas	bccb022fc8	fix(markdown): fix empty block handling (#843 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-30 16:22:29 +01:00
Maxim Lysak	fea0a99a95	fix: Fix for the crash when encountering WMF images in pptx and docx (#837 ) * Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated faq Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-01-30 14:58:27 +01:00
Michele Dolfi	d01a2e73ee	test: update results with new docling-core (#839 ) * test: update results with new docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix table output in 2203.01017v2.md Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-30 14:07:52 +01:00
Peter W. J. Staar	d7c082894e	docs: updated the readme with upcoming features (#831 ) * updated the readme with upcoming features Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the docs-index Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-01-30 09:52:54 +01:00
Christoph Auer	f9144f2bb6	docs: Add example for inspection of picture content (#624 ) * chore: Add example for inspection of picture content Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Test case re-generation Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Test case re-generation only on CPU Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Add missing GT files Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-29 10:39:00 +01:00
github-actions[bot]	4d11d87d06	chore: bump version to 2.17.0 [skip ci]	2025-01-28 18:37:26 +00:00
Panos Vagenas	5aed9f8aeb	fix: fix single newline handling in MD backend (#824 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-28 19:05:55 +01:00
Cesar Berrospi Ramis	adf6353483	fix: use file extension if filetype fails with PDF (#827 ) Filetype library may not identify some files as PDF. Leverage the file extension as a simple solution. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-28 19:03:54 +01:00
Panos Vagenas	ba521dd88f	chore: add missing imports to Office type tests (#826 ) * chore: add missing import to XLSX test Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * Update test_backend_msword.py [skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * Update test_backend_pptx.py Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-28 16:17:44 +01:00
Panos Vagenas	6875913e34	docs: document Docling JSON parsing (#819 ) * docs: document Docling JSON parsing Also: - factored out and expanded supported formats - reorged feature list Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * update feature list, minor fixes Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-28 13:23:30 +01:00
Anastas Stoyanovsky	5139b48e4e	docs: Add SSL verification error mitigation (#821 ) Add SSL verification error mitigation Signed-off-by: Anastas Stoyanovsky <astoyano@redhat.com>	2025-01-28 07:22:43 +01:00
Michele Dolfi	6882e6c38d	feat(CLI): Expose code and formula models in the CLI (#820 ) feat: expose code and formula models in the CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-28 06:26:03 +01:00
Cesar Berrospi Ramis	4d41db3f7a	docs(backend XML): do not delete temp file in notebook (#817 ) Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-27 18:53:39 +01:00
Cesar Berrospi Ramis	a112d7a035	fix: parse html with omitted body tag (#818 ) * fix: parse HTML files without body tag Parse HTML files without 'body' tag, since it is optional in HTML5 specification. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * test: ensure docling converts HTML without body tag Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-27 16:59:00 +01:00

1 2 3 4 5 ...

362 Commits