Christoph Auer
b66fb830c9
Merge pull request #556 from DS4SD/cau/layout-processing-improvement
...
feat: layout processing improvements and bugfixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:29:07 +01:00
Christoph Auer
184eed4095
Merge pull request #514 from DS4SD/nli/performance
...
feat(Accelerator): Introduce AI runtime configuration scheme
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:26:27 +01:00
Nikos Livathinos
5c69081453
fix: Ocr AccleratorDevice
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 15:23:56 +00:00
Nikos Livathinos
6bc1bd2ec4
fix: Correct the way to set GPU for EasyOCR, RapidOCR
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 15:07:54 +00:00
Nikos Livathinos
99ccb69a47
fix: Do proper check to set the device in EasyOCR, RapidOCR.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 14:46:21 +00:00
Christoph Auer
ce82e23b66
Merge branch 'release_v3' into nli/performance
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:54 +01:00
Christoph Auer
d006b937ad
Rebase from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:26 +01:00
Nikos Livathinos
c21ada4b22
fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. ( #544 )
...
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Silence the tqdm messages during the downloading of model files
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Code styling
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Use the HF API to disable the tqdm progress bars
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 15:57:37 +01:00
Christoph Auer
fbb28b851d
Updated test ground-truth (again), bugfix for empty layout
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:50:04 +01:00
Christoph Auer
840f5e15ed
feat: docling-parse v2 as default PDF backend ( #549 )
...
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade to ds-glm 1.0 and docling-parse 3.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix DP2 backend code, change CLI default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
Christoph Auer
731e48ea43
Updated test ground-truth
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:19:38 +01:00
Nikos Livathinos
1149d3ae08
fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 11:12:28 +01:00
github-actions[bot]
d15d656c39
chore: bump version to 2.9.0 [skip ci]
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 09:33:55 +00:00
Panos Vagenas
48d2cb3505
feat: expose new hybrid chunker, update docs ( #384 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 08:28:29 +01:00
Maxim Lysak
dc71b8c004
fix: Correcting DefaultText ID for MS Word backend ( #537 )
...
Correcting DefaultText ID for MS Word backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 15:48:35 +01:00
Maxim Lysak
c31d9f032e
feat(MS Word backend): Make detection of headers and other styles localization agnostic ( #534 )
...
Using style id instead of style names, which should be localization agnostic
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 15:17:56 +01:00
Nikos Livathinos
f63e5ef3b5
fix: Improve the pydantic objects in the pipeline_options and imports.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 14:56:35 +01:00
Michele Dolfi
a38f57efce
ci: allow ! in conventionalcommits ( #533 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 14:50:10 +01:00
Sander Maijers
ba32fb8637
fix: Add py.typed
marker file ( #531 )
...
feat: add `py.typed` marker file
See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information
Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 13:42:14 +01:00
Panos Vagenas
6f7b128867
docs: document new integrations ( #532 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 13:18:14 +01:00
Peter W. J. Staar
54b4daa2dd
fix: Enable HTML export in CLI and add options for image mode ( #513 )
...
* updated README
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed duck in title
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the index.md
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the cli to export html
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added html to cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed the duck emoji, added the in the cli. Currently, the referenced seems broken
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* cleaning up the comments
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reference is now working
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Clean up styling and docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pin docling-core>=2.7.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Maxim Lysak
63f1125d5c
fix: Missing text in docx (t tag) when embedded in a table ( #528 )
...
Fix for missing text in docx (t tag) when embedded in a table
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:25 +01:00
Christoph Auer
71f3a7ac3c
Rebase from release_v3
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:33:38 +01:00
Christoph Auer
b0da1a2127
Merge pull request #504 from DS4SD/cau/layout-postprocessing
...
feat: Support hierarchical layout components, expose and group content in pictures, forms and key-value regions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:26:34 +01:00
Michele Dolfi
bed92b766f
fix: restore pydantic version pin after fixes ( #512 )
...
* test: pin new docling-core changes and release pydantic pinning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin docling-core release
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 09:33:39 +01:00
Nikos Livathinos
3bb7df66ca
feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
...
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 17:29:09 +01:00
Christoph Auer
84f3548d30
Clean up imports again
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 15:22:43 +01:00
Michele Dolfi
e36f7d82f6
fix: folder input in cli ( #511 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 14:22:00 +01:00
Christoph Auer
e97688cd3d
Merge branch 'release_v3' of github.com:DS4SD/docling into cau/layout-postprocessing
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 14:21:09 +01:00
Christoph Auer
11c7c43bad
Move to_docling_document from ds-glm to this repo
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 13:11:41 +01:00
github-actions[bot]
78fad801fe
chore: bump version to 2.8.3 [skip ci]
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 15:16:47 +00:00
Christoph Auer
0240ae2930
Pass nested clusters through GLM as payload
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:58:27 +01:00
Christoph Auer
4dcc738b6d
Pass nested cluster processing through full pipeline
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:08:45 +01:00
Christoph Auer
0be736227f
fix: improve handling of disallowed formats ( #429 )
...
* fix: Fixes and tests for StopIteration on .convert()
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Remove unnecessary case handling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Other test fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* improve handling of unsupported types
- Introduced new explicit exception types instead of `RuntimeError`
- Introduced new `ConversionStatus` value for unsupported formats
- Tidied up converter member typing & removed asserts
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* robustify & simplify format option resolution
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* rename new status, populate ConversionResult errors
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 12:45:32 +01:00
github-actions[bot]
25a0fa38d1
chore: bump version to 2.8.2 [skip ci]
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:47:29 +00:00
Michele Dolfi
9f35e368f6
chore: update numpy lock ( #500 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:21:31 +01:00
guglie
a7e3f713bb
fix: ParserError EOF inside string ( #470 ) ( #472 )
...
Signed-off-by: guglie <gdguglie@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:21:18 +01:00
Michele Dolfi
a01cedbb69
docs: add styling for faq ( #502 )
...
* docs: add styling to faq
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove torchaudio
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:20:49 +01:00
Panos Vagenas
418d8159bd
perf: prevent temp file leftovers, reuse core type ( #487 )
...
* chore: reuse DocumentStream from docling-core
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* update docling-core version
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* [skip ci] document import line
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490 )
use new resolve_source_to_x functions
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:40:28 +01:00
Christoph Auer
7245cc6080
Implement hierachical cluster layout processing
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:28:36 +01:00
Gaspard Petit
32e9b4a2cf
fix: PermissionError when using tesseract_ocr_cli_model ( #496 )
...
Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:22:03 +01:00
Christoph Auer
e0cf80a919
Upgraded Layout Postprocessing, sending old code back to ERZ
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-02 16:46:51 +01:00
Álvaro Huertas
6ca85993f4
docs: typo in faq ( #484 )
...
Typo faq.md
Signed-off-by: Álvaro Huertas <123009293+huertin03@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-02 10:35:24 +01:00
Michele Dolfi
048031d32b
docs: add automatic api reference ( #475 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-02 09:55:52 +01:00
Michele Dolfi
0e0360a37b
docs: introduce faq section ( #468 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 22:34:56 +01:00
github-actions[bot]
1d81b85443
chore: bump version to 2.8.1 [skip ci]
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 13:04:48 +00:00
Michele Dolfi
7bd432496a
fix(cli): expose debug options ( #467 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 13:25:58 +01:00
Michele Dolfi
861b6a6499
fix: remove unused deps ( #466 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-29 13:18:06 +01:00
Panos Vagenas
9d8d698921
docs: extend integration docs & README ( #456 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-28 09:41:21 +01:00
github-actions[bot]
20a2cd0f53
chore: bump version to 2.8.0 [skip ci]
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-27 13:29:32 +00:00