Michele Dolfi
|
3d66062db8
|
missing one part of the comment
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-08 18:37:24 +02:00 |
|
Michele Dolfi
|
800b16beff
|
keep only one example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-08 18:36:53 +02:00 |
|
Nikos Livathinos
|
bb8cd0f7fc
|
fix: Rename the tesseract OCR related classes and filenames
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-08 16:46:25 +02:00 |
|
Nikos Livathinos
|
70a8a2cc82
|
chore(OCR): Rename class names to use Tesseract for the tesserocr and TesseractCLI for the tesseract process
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-08 14:44:23 +02:00 |
|
Nikos Livathinos
|
074acd703c
|
feat(OCR): Introduce support for the language path in the pipelines of both Tesseract OCR engines.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-08 14:30:47 +02:00 |
|
Nikos Livathinos
|
118afee1f3
|
fix(TesserOcrModel): Fix cell coordinates
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-08 14:30:47 +02:00 |
|
Nikos Livathinos
|
29e65e911b
|
fix(test): Introduce parameter in verify_conversion_result() to allow skipping the verification of the cells. It is used in case of OCR tests.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-08 14:30:33 +02:00 |
|
Nikos Livathinos
|
072aaf6bb1
|
fix(test): Update test data for OCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-08 14:30:18 +02:00 |
|
Michele Dolfi
|
5bd64779d1
|
add docs for TESSDATA_PREFIX
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-08 11:37:24 +02:00 |
|
Michele Dolfi
|
ea3f720ef5
|
remove pydantic warning for model_
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-08 11:32:54 +02:00 |
|
Michele Dolfi
|
67746044a9
|
Merge remote-tracking branch 'origin/main' into feat-multiple-ocr-engines
|
2024-10-08 10:55:08 +02:00 |
|
Michele Dolfi
|
73108d597c
|
docs: explain OCR options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-08 10:54:43 +02:00 |
|
Fasal Shah
|
d412c363d7
|
fixed unload pdf backend resources (#129)
Signed-off-by: faisal shah <fashah@redhat.com>
Co-authored-by: faisal shah <fashah@redhat.com>
|
2024-10-08 10:46:43 +02:00 |
|
Michele Dolfi
|
471daee277
|
reorder sections in custom_convert
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-08 09:53:52 +02:00 |
|
Nikos Livathinos
|
8ec8c38de8
|
fix(CI/CD): Use the eng language package location to set the TESSDATA_PREFIX envvar
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-08 07:08:51 +02:00 |
|
Nikos Livathinos
|
be6489bde0
|
fix(tests): Refactor the data_scanned with a very simple document that allows all OCR engines to produce the same result.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-08 07:07:28 +02:00 |
|
Michele Dolfi
|
7532ede7f4
|
fix tessdata env
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-07 18:35:18 +02:00 |
|
Nikos Livathinos
|
bd1837f2f6
|
fix(CI/CD): Add envvar TESSDATA_PREFIX in the checks.yml to ensure that tesseract has the proper path for the language models.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-07 17:26:58 +02:00 |
|
Nikos Livathinos
|
6faff146e0
|
fix(OCR): Skip zero area OCR cells for all OCR engines
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-07 17:26:58 +02:00 |
|
Nikos Livathinos
|
a9b22a8694
|
fix(BoundingBox): Fixing the BoundingBox.area() method to work for all values of CoordOrigin
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-07 17:26:58 +02:00 |
|
Michele Dolfi
|
9eb3afc16c
|
expose easyocr arguments
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-07 15:17:40 +02:00 |
|
Michele Dolfi
|
99dfbf6107
|
add tesseract language packages
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-07 15:14:10 +02:00 |
|
Nikos Livathinos
|
49652eec54
|
feat(tests): Introduce fuzzy text comparison for OCR tests based on Levenshtein edit distance
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-04 14:13:24 +02:00 |
|
github-actions[bot]
|
9b82ae3324
|
chore: bump version to 1.18.0 [skip ci]
|
2024-10-03 17:16:00 +00:00 |
|
Michele Dolfi
|
544f298fb4
|
add missing install
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-03 19:05:48 +02:00 |
|
Michele Dolfi
|
b3293ffc75
|
update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-03 19:04:02 +02:00 |
|
Michele Dolfi
|
2784d9c3b5
|
Merge remote-tracking branch 'origin/main' into feat-multiple-ocr-engines
|
2024-10-03 19:02:01 +02:00 |
|
Michele Dolfi
|
f57e4b2afb
|
add tesseract in CI, improve error messages and allow to specify the tesseract cmd
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-03 18:59:29 +02:00 |
|
Maxim Lysak
|
2422f706a1
|
feat: new torch-based docling models (#120)
---------
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
|
2024-10-03 18:42:33 +02:00 |
|
Nikos Livathinos
|
e571ab50ee
|
fix(tests): Extend test_e2e_ocr_conversion to cover all OCR engines (easyocr, tesserocr, tesseract)
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-03 16:49:23 +02:00 |
|
Nikos Livathinos
|
7ab3b62c18
|
chore(data_scanned): Simplify the OCR test images. Add GT for easyocr, tesserocr, tesseract
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-03 16:46:51 +02:00 |
|
github-actions[bot]
|
9ebbbc1245
|
chore: bump version to 1.17.0 [skip ci]
|
2024-10-03 13:44:52 +00:00 |
|
Rui Dias Gomes
|
dde0aff8bd
|
update examples (#123)
Signed-off-by: rmdg88 <rmdg88@gmail.com>
|
2024-10-03 14:28:25 +02:00 |
|
Michele Dolfi
|
d44c62d7ce
|
feat: windows support (#122)
* feat: windows support
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add Windows in README
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-03 14:23:47 +02:00 |
|
Nikos Livathinos
|
1d4517ffb4
|
fix(TesserOcrModel): Refactor code to catch exception in case of import error
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-03 14:23:07 +02:00 |
|
Michele Dolfi
|
81d176cd3d
|
add message for failed easyocr import
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-03 13:38:01 +02:00 |
|
Nikos Livathinos
|
c28846a866
|
feat: Implement the TesserOcrModel. Introduce the test_e2e_ocr_conversion.py unit test.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-02 18:12:32 +02:00 |
|
Nikos Livathinos
|
a0e72655f7
|
chore: Update the data_scanned to have recognitions per ocr engine
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-02 18:12:32 +02:00 |
|
Peter Staar
|
fed3323e25
|
tesseract is working
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
|
2024-10-02 17:23:50 +02:00 |
|
Peter Staar
|
a3e2cf5473
|
fixed conflicts
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
|
2024-10-02 17:01:34 +02:00 |
|
Michele Dolfi
|
0b76211eed
|
add examples for swtching OCR engine and CLI support
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-02 16:57:48 +02:00 |
|
Peter Staar
|
8d1c1d6dd5
|
added the tesseract_model.py
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
|
2024-10-02 16:40:24 +02:00 |
|
Nikos Livathinos
|
bfdc4e32cc
|
chore: Add test data with scanned documents and their conversions usinga EasyOCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-02 13:35:38 +02:00 |
|
Nikos Livathinos
|
c211808742
|
feat: tesseract and tesserocr models. WIP.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-02 13:35:00 +02:00 |
|
Nikos Livathinos
|
455d6ff70f
|
chore: Add tesserocr in poetry
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
|
2024-10-02 13:27:34 +02:00 |
|
Michele Dolfi
|
bbfc0617f2
|
feat: add options for choosing OCR engine
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-10-02 10:47:20 +02:00 |
|
github-actions[bot]
|
cde671cf34
|
chore: bump version to 1.16.1 [skip ci]
|
2024-09-27 14:36:40 +00:00 |
|
Michele Dolfi
|
34bd887a7f
|
fix: allow usage of opencv 4.6.x (#110)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
2024-09-27 15:51:43 +02:00 |
|
Panos Vagenas
|
c05b692d69
|
docs: document chunking (#111)
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
|
2024-09-27 11:16:04 +02:00 |
|
github-actions[bot]
|
6760571fe1
|
chore: bump version to 1.16.0 [skip ci]
|
2024-09-27 06:21:15 +00:00 |
|