Commit Graph

  • 7532ede7f4 fix tessdata env Michele Dolfi 2024-10-07 18:35:18 +0200
  • bd1837f2f6 fix(CI/CD): Add envvar TESSDATA_PREFIX in the checks.yml to ensure that tesseract has the proper path for the language models. Nikos Livathinos 2024-10-07 17:18:18 +0200
  • 6faff146e0 fix(OCR): Skip zero area OCR cells for all OCR engines Nikos Livathinos 2024-10-07 17:01:47 +0200
  • a9b22a8694 fix(BoundingBox): Fixing the BoundingBox.area() method to work for all values of CoordOrigin Nikos Livathinos 2024-10-07 15:41:35 +0200
  • f773d8a621 Improved demo code, that saves output mds to files Maxim Lysak 2024-10-07 17:25:17 +0200
  • 9eb3afc16c expose easyocr arguments Michele Dolfi 2024-10-07 15:17:40 +0200
  • 99dfbf6107 add tesseract language packages Michele Dolfi 2024-10-07 15:14:10 +0200
  • bea9fc22af Added mspowerpoint backend first implementation, improvements on msword backend Maxim Lysak 2024-10-07 14:55:21 +0200
  • c28cc680ef Integrate docling-parse v2 backend Christoph Auer 2024-10-07 13:52:02 +0200
  • 1346843301 Improved docx parsing Maxim Lysak 2024-10-07 13:00:50 +0200
  • e613f7bc6c Add comments Christoph Auer 2024-10-07 12:35:25 +0200
  • 2cb097051f fixed unload pdf backend resources faisal shah 2024-10-06 10:51:08 +0530
  • cefc34e8d8 Working on a first version of DOCX native backend Maxim Lysak 2024-10-04 18:19:40 +0200
  • 86ead45aa1 align with isort extend-metadata-in-examples Panos Vagenas 2024-10-04 15:25:52 +0200
  • 86fd560cfd minor notebook updates Panos Vagenas 2024-10-04 14:50:38 +0200
  • 6e16a2464e add docling splitter to LC example, simplify & align QA output Panos Vagenas 2024-10-04 14:43:27 +0200
  • 49652eec54 feat(tests): Introduce fuzzy text comparison for OCR tests based on Levenshtein edit distance Nikos Livathinos 2024-10-04 14:13:24 +0200
  • f4ee76eaec chore: showcase extended metadata in LlamaIndex example Panos Vagenas 2024-09-27 19:31:43 +0200
  • 9b82ae3324 chore: bump version to 1.18.0 [skip ci] v1.18.0 github-actions[bot] 2024-10-03 17:16:00 +0000
  • 544f298fb4 add missing install Michele Dolfi 2024-10-03 19:05:48 +0200
  • b3293ffc75 update test results Michele Dolfi 2024-10-03 19:04:02 +0200
  • 2784d9c3b5 Merge remote-tracking branch 'origin/main' into feat-multiple-ocr-engines Michele Dolfi 2024-10-03 19:02:01 +0200
  • f57e4b2afb add tesseract in CI, improve error messages and allow to specify the tesseract cmd Michele Dolfi 2024-10-03 18:59:29 +0200
  • 2422f706a1
    feat: new torch-based docling models (#120) Maxim Lysak 2024-10-03 18:42:33 +0200
  • 65ed754d37 Updated to docling-ibm-models v2.0.0 Maxim Lysak 2024-10-03 17:46:06 +0200
  • 5d5e3ed0ac Updated dependency on models Maxim Lysak 2024-10-03 16:52:25 +0200
  • 0be9285dac Updated formatting Maxim Lysak 2024-10-03 16:42:12 +0200
  • a614710aa3 Updated tests Maxim Lysak 2024-10-03 16:36:12 +0200
  • aba833ab56 Adapting label mapping for updated layout model Maxim Lysak 2024-10-03 16:04:40 +0200
  • 9b72a61914 Updating tests and layout_model_path for new torch-based layout model Maxim Lysak 2024-10-02 15:37:15 +0200
  • e571ab50ee fix(tests): Extend test_e2e_ocr_conversion to cover all OCR engines (easyocr, tesserocr, tesseract) Nikos Livathinos 2024-10-03 16:47:46 +0200
  • 7ab3b62c18 chore(data_scanned): Simplify the OCR test images. Add GT for easyocr, tesserocr, tesseract Nikos Livathinos 2024-10-03 16:39:16 +0200
  • 9ebbbc1245 chore: bump version to 1.17.0 [skip ci] v1.17.0 github-actions[bot] 2024-10-03 13:44:52 +0000
  • dde0aff8bd
    update examples (#123) Rui Dias Gomes 2024-10-03 13:28:25 +0100
  • d44c62d7ce
    feat: windows support (#122) Michele Dolfi 2024-10-03 14:23:47 +0200
  • 1d4517ffb4 fix(TesserOcrModel): Refactor code to catch exception in case of import error Nikos Livathinos 2024-10-03 14:23:07 +0200
  • c5f765aaf9 update examples rmdg88 2024-10-03 12:29:37 +0100
  • 3818fa6d51 add Windows in README Michele Dolfi 2024-10-03 13:48:34 +0200
  • f8cbf5df3c feat: windows support Michele Dolfi 2024-10-03 13:46:35 +0200
  • 81d176cd3d add message for failed easyocr import Michele Dolfi 2024-10-03 13:38:01 +0200
  • 3ad50afd45 apply black reformat rmdg88 2024-10-02 17:30:30 +0100
  • 3a6690ca75 update examples rmdg88 2024-10-02 16:34:22 +0100
  • c28846a866 feat: Implement the TesserOcrModel. Introduce the test_e2e_ocr_conversion.py unit test. Nikos Livathinos 2024-10-02 17:47:01 +0200
  • a0e72655f7 chore: Update the data_scanned to have recognitions per ocr engine Nikos Livathinos 2024-10-02 17:37:10 +0200
  • fed3323e25 tesseract is working Peter Staar 2024-10-02 17:23:50 +0200
  • a3e2cf5473 fixed conflicts Peter Staar 2024-10-02 17:01:34 +0200
  • 0b76211eed add examples for swtching OCR engine and CLI support Michele Dolfi 2024-10-02 16:57:48 +0200
  • 8d1c1d6dd5 added the tesseract_model.py Peter Staar 2024-10-02 16:40:24 +0200
  • bfdc4e32cc chore: Add test data with scanned documents and their conversions usinga EasyOCR nli/tesseract_ocr_models Nikos Livathinos 2024-10-02 13:35:38 +0200
  • c211808742 feat: tesseract and tesserocr models. WIP. Nikos Livathinos 2024-10-02 13:30:27 +0200
  • 455d6ff70f chore: Add tesserocr in poetry Nikos Livathinos 2024-10-02 13:27:34 +0200
  • bbfc0617f2 feat: add options for choosing OCR engine Michele Dolfi 2024-10-02 10:47:20 +0200
  • 1fa7cd9855 Fundamental refactoring for multi-format support Christoph Auer 2024-10-01 16:27:22 +0200
  • cd06d89c2a Merge branch 'cau/experimental-format' of github.com:DS4SD/docling into cau/input-format-abstraction Christoph Auer 2024-09-30 13:47:57 +0200
  • 0a86529afb Repinning Christoph Auer 2024-09-30 13:47:22 +0200
  • cde671cf34 chore: bump version to 1.16.1 [skip ci] v1.16.1 github-actions[bot] 2024-09-27 14:36:40 +0000
  • 34bd887a7f
    fix: allow usage of opencv 4.6.x (#110) Michele Dolfi 2024-09-27 15:51:43 +0200
  • 91ab382129 Renaming changes Christoph Auer 2024-09-27 15:19:35 +0200
  • c05b692d69
    docs: document chunking (#111) Panos Vagenas 2024-09-27 11:16:04 +0200
  • c9b59ccd2a docs: document chunking Panos Vagenas 2024-09-26 22:57:42 +0200
  • ec453d2229 fix: allow usage of opencv 4.6.x Michele Dolfi 2024-09-27 09:58:38 +0200
  • 2461b56b84 Import rewrites, adapt to changes in docling-core Christoph Auer 2024-09-27 09:21:15 +0200
  • 6760571fe1 chore: bump version to 1.16.0 [skip ci] v1.16.0 github-actions[bot] 2024-09-27 06:21:15 +0000
  • d6df76f90b
    feat: Support tableformer model choice (#90) Christoph Auer 2024-09-26 21:37:08 +0200
  • 9ffd1dc396 Merge from main Christoph Auer 2024-09-26 18:06:08 +0200
  • ba1001d23c
    Update Dockerfile Christoph Auer 2024-09-26 17:45:17 +0200
  • d4283ccbd4
    Adjust parameters on custom_convert Christoph Auer 2024-09-26 17:44:48 +0200
  • 78db47d02a Update README Christoph Auer 2024-09-26 13:48:30 +0200
  • 5843f1ee1f Ensure import backwards-compatibility for PipelineOptions Christoph Auer 2024-09-26 13:46:41 +0200
  • a30a520948 Merge branch 'main' of github.com:DS4SD/docling into cau/tableformer-configuration Christoph Auer 2024-09-26 13:37:42 +0200
  • 0ee82a5e78 Bump deepsearch-glm Christoph Auer 2024-09-25 16:05:54 +0200
  • ba9d115f64 Examples: Don't export experimental output by default Christoph Auer 2024-09-25 15:56:29 +0200
  • ad2bd714d4 Update GT test files for pages Christoph Auer 2024-09-25 15:54:55 +0200
  • 39977b5631
    chore: move examples extras to respective group (#103) Panos Vagenas 2024-09-25 15:47:48 +0200
  • 48d8b7bf70 Sync test data from main Christoph Auer 2024-09-25 12:26:12 +0200
  • 3efc2bbbf4 Apply renamings to DocItemLabel Christoph Auer 2024-09-25 12:22:02 +0200
  • 95c539579d [WIP] introducting extra backend abstraction and input formats Christoph Auer 2024-09-25 11:17:49 +0200
  • 7a3ffabe1c chore: move examples extras to respective group Panos Vagenas 2024-09-25 11:09:42 +0200
  • 3dfd02a7e9 chore: bump version to 1.15.0 [skip ci] v1.15.0 github-actions[bot] 2024-09-24 15:58:16 +0000
  • 6a03c208ec
    feat: add figure in markdown (#98) Michele Dolfi 2024-09-24 17:28:23 +0200
  • 850a521195 Update lockfile Christoph Auer 2024-09-24 16:26:22 +0200
  • 33373ac0dd Switch everything to use label enum, and more Christoph Auer 2024-09-24 16:00:39 +0200
  • b1a3a7a56c Merge remote-tracking branch 'origin/main' into feat-figure-in-markdown Michele Dolfi 2024-09-24 15:41:22 +0200
  • 1571e1e17d update with improved docling-core Michele Dolfi 2024-09-24 15:40:09 +0200
  • 001d214a13 chore: bump version to 1.14.0 [skip ci] v1.14.0 github-actions[bot] 2024-09-24 13:38:23 +0000
  • d96b96c848
    fix: fix OCR setting for pypdfium, minor refactor (#102) Panos Vagenas 2024-09-24 14:36:00 +0200
  • 867e06f9f2 Merge from main Christoph Auer 2024-09-24 12:05:17 +0200
  • b54956cce6 Add experimental output in glm_model Christoph Auer 2024-09-24 11:59:33 +0200
  • 7268747dcb fix: fix OCR setting for pypdfium, minor refactor Panos Vagenas 2024-09-24 10:28:56 +0200
  • f8f2303348
    docs: document CLI, minor README revamp (#100) Panos Vagenas 2024-09-24 09:21:28 +0200
  • f555815343
    chore: add RAG notebook titles (#101) Panos Vagenas 2024-09-24 09:17:46 +0200
  • 3c46e4266c
    feat: add URL support to CLI (#99) Panos Vagenas 2024-09-24 08:47:53 +0200
  • 4cd3a14835 docs: document CLI, minor README revamp Panos Vagenas 2024-09-24 07:20:30 +0200
  • c53e9b5cc6 feat: add URL support to CLI Panos Vagenas 2024-09-24 07:27:45 +0200
  • be45462853 chore: add RAG notebook titles Panos Vagenas 2024-09-24 07:24:17 +0200
  • c65a01c9b7 chore: bump version to 1.13.1 [skip ci] v1.13.1 github-actions[bot] 2024-09-23 19:04:01 +0000
  • ddb20be002 update to new docling-core and update test results with figures Michele Dolfi 2024-09-23 20:22:19 +0200
  • d0d1ac0957 feat: add figures in markdown Michele Dolfi 2024-09-23 20:13:03 +0200
  • 4794ce460a
    fix: updated the render_as_doctags with the new arguments from docling-core (#93) Peter W. J. Staar 2024-09-23 20:12:18 +0200
  • 43f8b9182d propagate xsize and ysize Michele Dolfi 2024-09-23 19:17:53 +0200