Commit Graph

208 Commits

Author SHA1 Message Date
Christoph Auer
57de8ad63a Fix generate_multimodal_pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-14 16:52:58 +02:00
Christoph Auer
3f0b01702b Update example export code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-14 16:40:40 +02:00
Christoph Auer
a50ba57a1f Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-10-14 16:36:20 +02:00
Christoph Auer
497ddb34a8 Big refactoring for legacy_document support
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-14 16:36:11 +02:00
Maxim Lysak
e87bf9ae06 Updated pptx backend, fixes issues with lists, also added more different list cases to example
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-14 16:20:17 +02:00
Michele Dolfi
08ab628e75 use self.artifacts_path
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-14 09:03:49 +02:00
Michele Dolfi
ab8f71511b fix artifacts_path via pipeline_options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-14 08:57:15 +02:00
Michele Dolfi
245b6c4c01 pin picture data with molecule
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-13 18:07:43 +02:00
Michele Dolfi
ddb509628e use do_ flag in pipeline_options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-13 16:54:46 +02:00
Michele Dolfi
7c8d7e222e use new PictureData
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-13 16:48:16 +02:00
Michele Dolfi
c1ed447c21 propagate raises, add enrichment model, some renaming
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-13 16:03:19 +02:00
Michele Dolfi
941b51aa3e missing renamed files
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 18:10:45 +02:00
Michele Dolfi
7f10a546d3 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 17:04:01 +02:00
Michele Dolfi
98f1a4597e rename and refactor *model*
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 16:57:40 +02:00
Christoph Auer
69f0ab419c Bump docling-core version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 16:55:01 +02:00
Christoph Auer
2a259b9723 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-10-11 16:47:20 +02:00
Christoph Auer
6efcf0a5a5 Add image format support to PdfBackend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 16:47:15 +02:00
Michele Dolfi
6c9f869dc7 fix default _enrich_document
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 16:38:45 +02:00
Michele Dolfi
5b5c99e9da Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-10-11 16:31:28 +02:00
Michele Dolfi
ca2a96d982 initial refactor iteration
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 16:31:13 +02:00
Christoph Auer
d0fccb9342 Merge from simplify-conv-api
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 15:57:08 +02:00
Christoph Auer
95c1f80087 Change code to use unordered/ordered list, robustifications
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 14:53:38 +02:00
Panos Vagenas
136f16e85a
feat!: simplify conversion API (#139)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-11 14:52:37 +02:00
Michele Dolfi
753f67a434 fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 13:06:32 +02:00
Michele Dolfi
94b5e1532d add GlmOptions
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 13:03:38 +02:00
Michele Dolfi
786b89efd9 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-10-11 12:59:11 +02:00
Michele Dolfi
c6e1471e02 use options objects
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 12:58:59 +02:00
Christoph Auer
3ee97c42b2 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-10-11 12:57:56 +02:00
Christoph Auer
52713f0cf5 Optionally produce legacy_doc
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 12:57:47 +02:00
Michele Dolfi
cc9bcc424d fix generation enabled
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 11:49:38 +02:00
Michele Dolfi
331ab36f04 Merge remote-tracking branch 'origin/main' into cau/input-format-abstraction
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 11:23:04 +02:00
Christoph Auer
025983f07b Backend error handling fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 11:18:47 +02:00
github-actions[bot]
2ec39636f0 chore: bump version to 1.19.1 [skip ci] 2024-10-11 08:52:09 +00:00
Christoph Auer
304d16029a More renaming, design enrichment interface
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 10:21:31 +02:00
Nikos Livathinos
dae2a3b667
fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138)
* feat(OCR tests): Introduce fuzziness in the text validation of OCR tests

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-11 10:21:19 +02:00
Michele Dolfi
051beae203 use new interface in minimal example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 08:30:09 +02:00
Christoph Auer
7aad3dc946 Update test cases for v2 2024-10-10 18:51:19 +02:00
Christoph Auer
cd72ea2412 Added verify_conversion_result_v2, Regenerate v1 and v2 test data
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 18:30:54 +02:00
Michele Dolfi
1bcad334f2 pin docling-parse release
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 18:30:09 +02:00
Michele Dolfi
3794f8245e add example PNG
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 18:29:26 +02:00
Michele Dolfi
a84ba6ddec list all PIL supported mime types
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 18:28:56 +02:00
Michele Dolfi
bde8186700 update pinning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 17:54:05 +02:00
Michele Dolfi
c31045754d Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-10-10 17:41:07 +02:00
Michele Dolfi
50c05b262a pin updates compatible with each other
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 17:40:32 +02:00
Christoph Auer
99cfea38d6 Added verify_conversion_result_v2, Regenerate v1 and v2 test data
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 15:37:59 +02:00
Christoph Auer
7cad290ceb Refactor test data, legacy usage and more
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 13:54:44 +02:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension (#135)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Maxim Lysak
da0700f959 Fixes for docx backend
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-09 16:52:44 +02:00
Christoph Auer
b5a27386c1 Merge from main, update OCR model and test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-09 16:04:19 +02:00
Christoph Auer
0dfbd0b6fc Update examples and test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-09 15:20:27 +02:00