Panos Vagenas
136f16e85a
feat!: simplify conversion API ( #139 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-11 14:52:37 +02:00
Michele Dolfi
753f67a434
fixes
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 13:06:32 +02:00
Michele Dolfi
94b5e1532d
add GlmOptions
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 13:03:38 +02:00
Michele Dolfi
786b89efd9
Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-10-11 12:59:11 +02:00
Michele Dolfi
c6e1471e02
use options objects
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 12:58:59 +02:00
Christoph Auer
3ee97c42b2
Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-10-11 12:57:56 +02:00
Christoph Auer
52713f0cf5
Optionally produce legacy_doc
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 12:57:47 +02:00
Michele Dolfi
cc9bcc424d
fix generation enabled
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 11:49:38 +02:00
Michele Dolfi
331ab36f04
Merge remote-tracking branch 'origin/main' into cau/input-format-abstraction
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 11:23:04 +02:00
Christoph Auer
025983f07b
Backend error handling fixes
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 11:18:47 +02:00
github-actions[bot]
2ec39636f0
chore: bump version to 1.19.1 [skip ci]
2024-10-11 08:52:09 +00:00
Christoph Auer
304d16029a
More renaming, design enrichment interface
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 10:21:31 +02:00
Nikos Livathinos
dae2a3b667
fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests ( #138 )
...
* feat(OCR tests): Introduce fuzziness in the text validation of OCR tests
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-11 10:21:19 +02:00
Michele Dolfi
051beae203
use new interface in minimal example
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 08:30:09 +02:00
Christoph Auer
7aad3dc946
Update test cases for v2
2024-10-10 18:51:19 +02:00
Christoph Auer
cd72ea2412
Added verify_conversion_result_v2, Regenerate v1 and v2 test data
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 18:30:54 +02:00
Michele Dolfi
1bcad334f2
pin docling-parse release
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 18:30:09 +02:00
Michele Dolfi
3794f8245e
add example PNG
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 18:29:26 +02:00
Michele Dolfi
a84ba6ddec
list all PIL supported mime types
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 18:28:56 +02:00
Michele Dolfi
bde8186700
update pinning
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 17:54:05 +02:00
Michele Dolfi
c31045754d
Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-10-10 17:41:07 +02:00
Michele Dolfi
50c05b262a
pin updates compatible with each other
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 17:40:32 +02:00
Christoph Auer
99cfea38d6
Added verify_conversion_result_v2, Regenerate v1 and v2 test data
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 15:37:59 +02:00
Christoph Auer
7cad290ceb
Refactor test data, legacy usage and more
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 13:54:44 +02:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension ( #135 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Maxim Lysak
da0700f959
Fixes for docx backend
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-09 16:52:44 +02:00
Christoph Auer
b5a27386c1
Merge from main, update OCR model and test cases
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-09 16:04:19 +02:00
Christoph Auer
0dfbd0b6fc
Update examples and test cases
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-09 15:20:27 +02:00
Panos Vagenas
6924999f1f
chore: explicitly manage pandas dependency ( #134 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 14:50:39 +02:00
github-actions[bot]
0ffc1708d2
chore: bump version to 1.19.0 [skip ci]
2024-10-08 17:42:29 +00:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines ( #118 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2024-10-08 19:07:08 +02:00
Christoph Auer
080042d06d
Merge from upstream
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 16:40:55 +02:00
Christoph Auer
203cf19b1b
Lots of improvements
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 16:38:42 +02:00
Maxim Lysak
07d952acf9
Improved backends
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-08 16:37:47 +02:00
Christoph Auer
c0447206af
Merge from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 14:42:33 +02:00
Christoph Auer
1d55cbdca9
Updates for Powerpoint backend
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 13:19:58 +02:00
Maxim Lysak
89e58ca730
Added HTML backend implementation, few improvements for other backends
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-08 11:14:44 +02:00
Fasal Shah
d412c363d7
fixed unload pdf backend resources ( #129 )
...
Signed-off-by: faisal shah <fashah@redhat.com>
Co-authored-by: faisal shah <fashah@redhat.com>
2024-10-08 10:46:43 +02:00
Maxim Lysak
f773d8a621
Improved demo code, that saves output mds to files
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-07 17:25:17 +02:00
Maxim Lysak
bea9fc22af
Added mspowerpoint backend first implementation, improvements on msword backend
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-07 14:55:21 +02:00
Maxim Lysak
1346843301
Improved docx parsing
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-07 13:00:50 +02:00
Christoph Auer
e613f7bc6c
Add comments
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-07 12:35:25 +02:00
Maxim Lysak
cefc34e8d8
Working on a first version of DOCX native backend
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-04 18:19:40 +02:00
github-actions[bot]
9b82ae3324
chore: bump version to 1.18.0 [skip ci]
2024-10-03 17:16:00 +00:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models ( #120 )
...
---------
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-03 18:42:33 +02:00
github-actions[bot]
9ebbbc1245
chore: bump version to 1.17.0 [skip ci]
2024-10-03 13:44:52 +00:00
Rui Dias Gomes
dde0aff8bd
update examples ( #123 )
...
Signed-off-by: rmdg88 <rmdg88@gmail.com>
2024-10-03 14:28:25 +02:00
Michele Dolfi
d44c62d7ce
feat: windows support ( #122 )
...
* feat: windows support
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add Windows in README
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 14:23:47 +02:00
Christoph Auer
1fa7cd9855
Fundamental refactoring for multi-format support
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-01 16:54:09 +02:00
Christoph Auer
cd06d89c2a
Merge branch 'cau/experimental-format' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-09-30 13:47:57 +02:00