Commit Graph

145 Commits

Author SHA1 Message Date
Christoph Auer
efc25225ac Introduce OCR confidence, propagate to orphan in post-processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-16 14:42:01 +01:00
Christoph Auer
c020f2cba3 Rebase from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-16 11:26:24 +01:00
github-actions[bot]
31184ad516 chore: bump version to 2.12.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-13 18:27:00 +00:00
Nikos Livathinos
16bd38cbf4 feat: Introduce support for GPU Accelerators (#593)
* Upgraded Layout Postprocessing, sending old code back to ERZ

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Implement hierachical cluster layout processing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested cluster processing through full pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested clusters through GLM as payload

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports again

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.

* fix: Improve the pydantic objects in the pipeline_options and imports.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Updated test ground-truth

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated test ground-truth (again), bugfix for empty layout

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Do proper check to set the device in EasyOCR, RapidOCR.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Rollback changes from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test gt

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unused debug settings

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Review fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Nail the accelerator defaults for MPS

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-13 17:45:22 +01:00
Christoph Auer
1aaf34056f Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-12 20:17:24 +01:00
Christoph Auer
ccab2db1d4 Update pinnings to docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-12 20:15:15 +01:00
github-actions[bot]
d1d0ddd924 chore: bump version to 2.11.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-12 08:16:05 +00:00
Christoph Auer
d094c4990a Repin to release package versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-11 13:16:35 +01:00
Christoph Auer
b66fb830c9 Merge pull request #556 from DS4SD/cau/layout-processing-improvement
feat: layout processing improvements and bugfixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:29:07 +01:00
github-actions[bot]
ca83a1f0c9 chore: bump version to 2.10.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:28:46 +00:00
Christoph Auer
440c16ff20 fix: Call into docling-core for legacy document transform (#551)
Call into docling-core for legacy document transform

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 17:06:47 +01:00
Christoph Auer
d006b937ad Rebase from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:26 +01:00
Christoph Auer
840f5e15ed feat: docling-parse v2 as default PDF backend (#549)
* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to ds-glm 1.0 and docling-parse 3.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix DP2 backend code, change CLI default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
github-actions[bot]
d15d656c39 chore: bump version to 2.9.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 09:33:55 +00:00
Panos Vagenas
48d2cb3505 feat: expose new hybrid chunker, update docs (#384)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 08:28:29 +01:00
Peter W. J. Staar
54b4daa2dd fix: Enable HTML export in CLI and add options for image mode (#513)
* updated README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed duck in title

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli to export html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added html to cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the duck emoji, added the  in the cli. Currently, the referenced seems broken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaning up the comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reference is now working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Clean up styling and docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-core>=2.7.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Michele Dolfi
bed92b766f fix: restore pydantic version pin after fixes (#512)
* test: pin new docling-core changes and release pydantic pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 09:33:39 +01:00
Christoph Auer
84f3548d30 Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 15:22:43 +01:00
github-actions[bot]
78fad801fe chore: bump version to 2.8.3 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 15:16:47 +00:00
github-actions[bot]
25a0fa38d1 chore: bump version to 2.8.2 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:47:29 +00:00
Michele Dolfi
9f35e368f6 chore: update numpy lock (#500)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:21:31 +01:00
Panos Vagenas
418d8159bd perf: prevent temp file leftovers, reuse core type (#487)
* chore: reuse DocumentStream from docling-core

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* update docling-core version

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* [skip ci] document  import line

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490)

use new resolve_source_to_x functions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:40:28 +01:00
Michele Dolfi
048031d32b docs: add automatic api reference (#475)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-02 09:55:52 +01:00
github-actions[bot]
1d81b85443 chore: bump version to 2.8.1 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 13:04:48 +00:00
Michele Dolfi
861b6a6499 fix: remove unused deps (#466)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 13:18:06 +01:00
github-actions[bot]
20a2cd0f53 chore: bump version to 2.8.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-27 13:29:32 +00:00
Swaymaw
85b29990be
feat(ocr): added support for RapidOCR engine (#415)
* adding rapidocr engine for ocr in docling

Signed-off-by: swayam-singhal <swayam.singhal@inito.com>

* fixing styling format

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* updating pyproject.toml and poetry.lock to fix ci bugs

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* help poetry pinning for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* simplifying rapidocr options so that device can be changed using a single option for all models

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* fix styling issues and small bug in rapidOcrOptions

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* use default device until we enable global management

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: swayam-singhal <swayam.singhal@inito.com>
Signed-off-by: Swaymaw <swaymaw@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: swayam-singhal <swayam.singhal@inito.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-27 13:57:41 +01:00
Christoph Auer
29807a2d68
fix: Update tests and examples for docling-core 2.5.1 (#449)
* Update tests for docling-core 2.5.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add export with referenced images to export_figures example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix OCR tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revert "Fix OCR tests"

This reverts commit 12b575946f51950fcacece99d4d6eb682125d779.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lockfile for docling-core 2.5.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-27 13:07:00 +01:00
github-actions[bot]
6666d9ec07 chore: bump version to 2.7.1 [skip ci] 2024-11-26 15:01:33 +00:00
Maxim Lysak
d0a1180478
fix: Fixes for wordx (#432)
* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-26 14:44:43 +01:00
Michele Dolfi
d7072b4b56
fix: force pydantic < 2.10.0 (#407)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-22 08:23:11 +01:00
github-actions[bot]
eb64f6d368 chore: bump version to 2.7.0 [skip ci] 2024-11-20 15:36:51 +00:00
Michele Dolfi
7b013abcf3
fix: python3.9 support (#396)
* fixes for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-parse with python3.9 wheels

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-20 15:21:40 +01:00
nuridol
6efa96c983
feat: add support for ocrmac OCR engine on macOS (#276)
* feat: add support for `ocrmac` OCR engine on macOS

- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.

This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* updated the poetry lock

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* Fix linting issues, update CLI docs, and add error for ocrmac use on non-Mac systems

- Resolved formatting and linting issues
- Updated `--ocr-engine` CLI option documentation for `ocrmac`
- Added RuntimeError for attempts to use `ocrmac` on non-Mac platforms

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* feat: add support for `ocrmac` OCR engine on macOS

- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.

This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* docs: update examples and installation for ocrmac support

- Added `OcrMacOptions` to `custom_convert.py` and `full_page_ocr.py` examples.
- Included usage comments and examples for `OcrMacOptions` in OCR pipelines.
- Updated installation guide to include instructions for installing `ocrmac`, noting macOS version requirements (10.15+).
- Highlighted that `ocrmac` leverages Apple's Vision framework as an OCR backend.

This enhances documentation for users working on macOS to leverage `ocrmac` effectively.

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* fix: update `ocrmac` dependency with macOS-specific marker

- Added `sys_platform == 'darwin'` marker to the `ocrmac` dependency in `pyproject.toml` to specify macOS compatibility.
- Updated the content hash in `poetry.lock` to reflect the changes.

This ensures the `ocrmac` dependency is only installed on macOS systems.

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

---------

Signed-off-by: Suhwan Seo <nuridol@gmail.com>
Co-authored-by: Suhwan Seo <nuridol@gmail.com>
2024-11-20 12:51:19 +01:00
github-actions[bot]
2cfaceb787 chore: bump version to 2.6.0 [skip ci] 2024-11-19 16:07:34 +00:00
Shubham Gupta
3f91e7d3f1
feat: added support for exporting DocItem to an image when page image is available (#379)
* Updated minimum docling-core version to 2.4.0

Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>

* Deprecated the generate_table_images option

Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>

* Updated examples to use get_image instead of element.image

Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>

---------

Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>
2024-11-19 16:28:52 +01:00
Peter W. J. Staar
926dfd29d5
feat: added excel backend (#334)
* feat: added excel backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first msexcel backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added tooling for the cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first working version for excel parsing of tables

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added proper typing for mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added proper typing for mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactor EXCEL to XLSX

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the unit tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran poetry lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding images to output [WIP]

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the msexcel

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the msexcel (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added tests for merged cells in excel

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-19 12:21:17 +01:00
Michele Dolfi
ca8524ecae
docs: add automatic generation of CLI reference (#325)
* docs: add automatic generation of CLI reference

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* install deps for building CLI ref

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 13:18:17 +01:00
github-actions[bot]
5a44236ac2 chore: bump version to 2.5.2 [skip ci] 2024-11-13 08:19:09 +00:00
github-actions[bot]
2c0c439a44 chore: bump version to 2.5.1 [skip ci] 2024-11-12 14:56:34 +00:00
github-actions[bot]
777237ebc9 chore: bump version to 2.5.0 [skip ci] 2024-11-12 10:19:55 +00:00
github-actions[bot]
be8aa17291 chore: bump version to 2.4.2 [skip ci] 2024-11-08 16:31:47 +00:00
github-actions[bot]
118f162e64 chore: bump version to 2.4.1 [skip ci] 2024-11-08 12:37:36 +00:00
github-actions[bot]
e30a9c25a2 chore: bump version to 2.4.0 [skip ci] 2024-11-04 15:11:09 +00:00
Panos Vagenas
862d78d271
chore: update pyproject.toml metadata (#229)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 15:48:00 +01:00
github-actions[bot]
9d8865856d chore: bump version to 2.3.1 [skip ci] 2024-10-30 18:23:53 +00:00
Michele Dolfi
eb679ccbb4
fix: simplify torch dependencies and update pinned docling deps (#190)
* fix: simplify torch dependencies and update pinned docling deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docling-ibm-models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-30 18:44:08 +01:00
github-actions[bot]
43349865d0 chore: bump version to 2.3.0 [skip ci] 2024-10-30 14:47:37 +00:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx (#186)
* add real e2e tests for html and docx

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the output of itxt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the text

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the examples (1)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the output of the test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the tests, moved the ground-truth

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* moved the ground-truth data

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the html tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restructure title fix (#187)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-30 13:14:56 +01:00
github-actions[bot]
dda2645d4c chore: bump version to 2.2.1 [skip ci] 2024-10-28 17:18:41 +00:00