Michele Dolfi
7532ede7f4
fix tessdata env
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-07 18:35:18 +02:00
Nikos Livathinos
bd1837f2f6
fix(CI/CD): Add envvar TESSDATA_PREFIX in the checks.yml to ensure that tesseract has the proper path for the language models.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-07 17:26:58 +02:00
Nikos Livathinos
6faff146e0
fix(OCR): Skip zero area OCR cells for all OCR engines
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-07 17:26:58 +02:00
Nikos Livathinos
a9b22a8694
fix(BoundingBox): Fixing the BoundingBox.area() method to work for all values of CoordOrigin
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-07 17:26:58 +02:00
Michele Dolfi
9eb3afc16c
expose easyocr arguments
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-07 15:17:40 +02:00
Michele Dolfi
99dfbf6107
add tesseract language packages
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-07 15:14:10 +02:00
Nikos Livathinos
49652eec54
feat(tests): Introduce fuzzy text comparison for OCR tests based on Levenshtein edit distance
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-04 14:13:24 +02:00
Michele Dolfi
544f298fb4
add missing install
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 19:05:48 +02:00
Michele Dolfi
b3293ffc75
update test results
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 19:04:02 +02:00
Michele Dolfi
2784d9c3b5
Merge remote-tracking branch 'origin/main' into feat-multiple-ocr-engines
2024-10-03 19:02:01 +02:00
Michele Dolfi
f57e4b2afb
add tesseract in CI, improve error messages and allow to specify the tesseract cmd
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 18:59:29 +02:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models ( #120 )
...
---------
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-03 18:42:33 +02:00
Nikos Livathinos
e571ab50ee
fix(tests): Extend test_e2e_ocr_conversion to cover all OCR engines (easyocr, tesserocr, tesseract)
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-03 16:49:23 +02:00
Nikos Livathinos
7ab3b62c18
chore(data_scanned): Simplify the OCR test images. Add GT for easyocr, tesserocr, tesseract
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-03 16:46:51 +02:00
github-actions[bot]
9ebbbc1245
chore: bump version to 1.17.0 [skip ci]
2024-10-03 13:44:52 +00:00
Rui Dias Gomes
dde0aff8bd
update examples ( #123 )
...
Signed-off-by: rmdg88 <rmdg88@gmail.com>
2024-10-03 14:28:25 +02:00
Michele Dolfi
d44c62d7ce
feat: windows support ( #122 )
...
* feat: windows support
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add Windows in README
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 14:23:47 +02:00
Nikos Livathinos
1d4517ffb4
fix(TesserOcrModel): Refactor code to catch exception in case of import error
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-03 14:23:07 +02:00
Michele Dolfi
81d176cd3d
add message for failed easyocr import
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 13:38:01 +02:00
Nikos Livathinos
c28846a866
feat: Implement the TesserOcrModel. Introduce the test_e2e_ocr_conversion.py unit test.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-02 18:12:32 +02:00
Nikos Livathinos
a0e72655f7
chore: Update the data_scanned to have recognitions per ocr engine
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-02 18:12:32 +02:00
Peter Staar
fed3323e25
tesseract is working
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-02 17:23:50 +02:00
Peter Staar
a3e2cf5473
fixed conflicts
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-02 17:01:34 +02:00
Michele Dolfi
0b76211eed
add examples for swtching OCR engine and CLI support
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-02 16:57:48 +02:00
Peter Staar
8d1c1d6dd5
added the tesseract_model.py
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-02 16:40:24 +02:00
Nikos Livathinos
bfdc4e32cc
chore: Add test data with scanned documents and their conversions usinga EasyOCR
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-02 13:35:38 +02:00
Nikos Livathinos
c211808742
feat: tesseract and tesserocr models. WIP.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-02 13:35:00 +02:00
Nikos Livathinos
455d6ff70f
chore: Add tesserocr in poetry
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-02 13:27:34 +02:00
Michele Dolfi
bbfc0617f2
feat: add options for choosing OCR engine
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-02 10:47:20 +02:00
github-actions[bot]
cde671cf34
chore: bump version to 1.16.1 [skip ci]
2024-09-27 14:36:40 +00:00
Michele Dolfi
34bd887a7f
fix: allow usage of opencv 4.6.x ( #110 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-27 15:51:43 +02:00
Panos Vagenas
c05b692d69
docs: document chunking ( #111 )
...
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-27 11:16:04 +02:00
github-actions[bot]
6760571fe1
chore: bump version to 1.16.0 [skip ci]
2024-09-27 06:21:15 +00:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice ( #90 )
...
* Support tableformer model choice
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update datamodel structure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add test unit for table options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Ensure import backwards-compatibility for PipelineOptions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update README
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust parameters on custom_convert
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Update Dockerfile
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-26 21:37:08 +02:00
Panos Vagenas
39977b5631
chore: move examples extras to respective group ( #103 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-25 15:47:48 +02:00
github-actions[bot]
3dfd02a7e9
chore: bump version to 1.15.0 [skip ci]
2024-09-24 15:58:16 +00:00
Michele Dolfi
6a03c208ec
feat: add figure in markdown ( #98 )
...
* feat: add figures in markdown
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update to new docling-core and update test results with figures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update with improved docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-24 17:28:23 +02:00
github-actions[bot]
001d214a13
chore: bump version to 1.14.0 [skip ci]
2024-09-24 13:38:23 +00:00
Panos Vagenas
d96b96c848
fix: fix OCR setting for pypdfium, minor refactor ( #102 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-24 14:36:00 +02:00
Panos Vagenas
f8f2303348
docs: document CLI, minor README revamp ( #100 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-24 09:21:28 +02:00
Panos Vagenas
f555815343
chore: add RAG notebook titles ( #101 )
...
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-24 09:17:46 +02:00
Panos Vagenas
3c46e4266c
feat: add URL support to CLI ( #99 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-24 08:47:53 +02:00
github-actions[bot]
c65a01c9b7
chore: bump version to 1.13.1 [skip ci]
2024-09-23 19:04:01 +00:00
Peter W. J. Staar
4794ce460a
fix: updated the render_as_doctags with the new arguments from docling-core ( #93 )
...
* updated the render_as_doctags with the new arguments from docling-core
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the doctags tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the README
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix poetry lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Fix formatting problems
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fixed the doctag export in docling/utils/export.py
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* propagate xsize and ysize
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-23 20:12:18 +02:00
Maxim Lysak
dce9934a0f
Updated to new, clean vector logo, svg and rendered png are provided ( #96 )
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-09-23 15:31:21 +02:00
Michele Dolfi
1f4b224ab6
chore: switch to gh apps user ( #92 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-20 17:02:27 +02:00
github-actions[bot]
6dd1e91c4a
chore: bump version to 1.13.0 [skip ci]
2024-09-18 09:26:03 +00:00
Maxim Lysak
0da7519896
docs: updated Docling logo.png with transparent background ( #88 )
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-09-18 10:39:11 +02:00
Michele Dolfi
f19bd43798
feat: add table exports ( #86 )
...
* feat: expose docling-core table exporters and add examples
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove temp internal implementation of html export
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin latest docling-core 1.4.0 with table exports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-18 08:44:13 +02:00
Peter W. J. Staar
442443a102
fix: bumped the glm version and adjusted the tests ( #83 )
...
* bumped the glm version and adjusted the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix hooks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the tests for tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-18 07:43:49 +02:00