Commit Graph

273 Commits

Author SHA1 Message Date
Nikos Livathinos
6bc1bd2ec4 fix: Correct the way to set GPU for EasyOCR, RapidOCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 15:07:54 +00:00
Nikos Livathinos
99ccb69a47 fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 14:46:21 +00:00
Christoph Auer
ce82e23b66 Merge branch 'release_v3' into nli/performance
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:54 +01:00
Christoph Auer
d006b937ad Rebase from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:26 +01:00
Nikos Livathinos
c21ada4b22 fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544)
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Silence the tqdm messages during the downloading of model files

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Code styling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Use the HF API to disable the tqdm progress bars

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 15:57:37 +01:00
Christoph Auer
fbb28b851d Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:50:04 +01:00
Christoph Auer
840f5e15ed feat: docling-parse v2 as default PDF backend (#549)
* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to ds-glm 1.0 and docling-parse 3.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix DP2 backend code, change CLI default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
Christoph Auer
731e48ea43 Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:19:38 +01:00
Nikos Livathinos
1149d3ae08 fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 11:12:28 +01:00
github-actions[bot]
d15d656c39 chore: bump version to 2.9.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 09:33:55 +00:00
Panos Vagenas
48d2cb3505 feat: expose new hybrid chunker, update docs (#384)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 08:28:29 +01:00
Maxim Lysak
dc71b8c004 fix: Correcting DefaultText ID for MS Word backend (#537)
Correcting DefaultText ID for MS Word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 15:48:35 +01:00
Maxim Lysak
c31d9f032e feat(MS Word backend): Make detection of headers and other styles localization agnostic (#534)
Using style id instead of style names, which should be localization agnostic

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 15:17:56 +01:00
Nikos Livathinos
f63e5ef3b5 fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 14:56:35 +01:00
Michele Dolfi
a38f57efce ci: allow ! in conventionalcommits (#533)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 14:50:10 +01:00
Sander Maijers
ba32fb8637 fix: Add py.typed marker file (#531)
feat: add `py.typed` marker file

See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information

Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 13:42:14 +01:00
Panos Vagenas
6f7b128867 docs: document new integrations (#532)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 13:18:14 +01:00
Peter W. J. Staar
54b4daa2dd fix: Enable HTML export in CLI and add options for image mode (#513)
* updated README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed duck in title

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli to export html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added html to cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the duck emoji, added the  in the cli. Currently, the referenced seems broken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaning up the comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reference is now working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Clean up styling and docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-core>=2.7.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Maxim Lysak
63f1125d5c fix: Missing text in docx (t tag) when embedded in a table (#528)
Fix for missing text in docx (t tag) when embedded in a table

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:25 +01:00
Christoph Auer
71f3a7ac3c Rebase from release_v3
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:33:38 +01:00
Christoph Auer
b0da1a2127 Merge pull request #504 from DS4SD/cau/layout-postprocessing
feat: Support hierarchical layout components, expose and group content in pictures, forms and key-value regions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:26:34 +01:00
Michele Dolfi
bed92b766f fix: restore pydantic version pin after fixes (#512)
* test: pin new docling-core changes and release pydantic pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 09:33:39 +01:00
Nikos Livathinos
3bb7df66ca feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 17:29:09 +01:00
Christoph Auer
84f3548d30 Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 15:22:43 +01:00
Michele Dolfi
e36f7d82f6 fix: folder input in cli (#511)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 14:22:00 +01:00
Christoph Auer
e97688cd3d Merge branch 'release_v3' of github.com:DS4SD/docling into cau/layout-postprocessing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 14:21:09 +01:00
Christoph Auer
11c7c43bad Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 13:11:41 +01:00
github-actions[bot]
78fad801fe chore: bump version to 2.8.3 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 15:16:47 +00:00
Christoph Auer
0240ae2930 Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:58:27 +01:00
Christoph Auer
4dcc738b6d Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:08:45 +01:00
Christoph Auer
0be736227f fix: improve handling of disallowed formats (#429)
* fix: Fixes and tests for StopIteration on .convert()

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Remove unnecessary case handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Other test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* improve handling of unsupported types

- Introduced new explicit exception types instead of `RuntimeError`
- Introduced new `ConversionStatus` value for unsupported formats
- Tidied up converter member typing & removed asserts

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* robustify & simplify format option resolution

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* rename new status, populate ConversionResult errors

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 12:45:32 +01:00
github-actions[bot]
25a0fa38d1 chore: bump version to 2.8.2 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:47:29 +00:00
Michele Dolfi
9f35e368f6 chore: update numpy lock (#500)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:21:31 +01:00
guglie
a7e3f713bb fix: ParserError EOF inside string (#470) (#472)
Signed-off-by: guglie <gdguglie@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:21:18 +01:00
Michele Dolfi
a01cedbb69 docs: add styling for faq (#502)
* docs: add styling to faq

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torchaudio

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:20:49 +01:00
Panos Vagenas
418d8159bd perf: prevent temp file leftovers, reuse core type (#487)
* chore: reuse DocumentStream from docling-core

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* update docling-core version

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* [skip ci] document  import line

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490)

use new resolve_source_to_x functions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:40:28 +01:00
Christoph Auer
7245cc6080 Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:28:36 +01:00
Gaspard Petit
32e9b4a2cf fix: PermissionError when using tesseract_ocr_cli_model (#496)
Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:22:03 +01:00
Christoph Auer
e0cf80a919 Upgraded Layout Postprocessing, sending old code back to ERZ
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-02 16:46:51 +01:00
Álvaro Huertas
6ca85993f4 docs: typo in faq (#484)
Typo faq.md

Signed-off-by: Álvaro Huertas <123009293+huertin03@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-02 10:35:24 +01:00
Michele Dolfi
048031d32b docs: add automatic api reference (#475)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-02 09:55:52 +01:00
Michele Dolfi
0e0360a37b docs: introduce faq section (#468)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 22:34:56 +01:00
github-actions[bot]
1d81b85443 chore: bump version to 2.8.1 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 13:04:48 +00:00
Michele Dolfi
7bd432496a fix(cli): expose debug options (#467)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 13:25:58 +01:00
Michele Dolfi
861b6a6499 fix: remove unused deps (#466)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 13:18:06 +01:00
Panos Vagenas
9d8d698921 docs: extend integration docs & README (#456)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-28 09:41:21 +01:00
github-actions[bot]
20a2cd0f53 chore: bump version to 2.8.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-27 13:29:32 +00:00
Swaymaw
85b29990be
feat(ocr): added support for RapidOCR engine (#415)
* adding rapidocr engine for ocr in docling

Signed-off-by: swayam-singhal <swayam.singhal@inito.com>

* fixing styling format

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* updating pyproject.toml and poetry.lock to fix ci bugs

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* help poetry pinning for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* simplifying rapidocr options so that device can be changed using a single option for all models

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* fix styling issues and small bug in rapidOcrOptions

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* use default device until we enable global management

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: swayam-singhal <swayam.singhal@inito.com>
Signed-off-by: Swaymaw <swaymaw@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: swayam-singhal <swayam.singhal@inito.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-27 13:57:41 +01:00
Manuel030
767563bf8b
fix: use correct image index in word backend (#442)
* fix image index in word backend

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* fix: Fixes for wordx (#432)

* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* sign dco

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* correct rebase error

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

---------

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-27 13:45:07 +01:00
Christoph Auer
29807a2d68
fix: Update tests and examples for docling-core 2.5.1 (#449)
* Update tests for docling-core 2.5.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add export with referenced images to export_figures example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix OCR tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revert "Fix OCR tests"

This reverts commit 12b575946f51950fcacece99d4d6eb682125d779.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lockfile for docling-core 2.5.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-27 13:07:00 +01:00