Christoph Auer
cd579fd28e
Merge pull request #556 from DS4SD/cau/layout-processing-improvement
...
feat: layout processing improvements and bugfixes
2024-12-10 16:29:07 +01:00
Christoph Auer
e282bfd8c8
Merge pull request #514 from DS4SD/nli/performance
...
feat(Accelerator): Introduce AI runtime configuration scheme
2024-12-10 16:26:27 +01:00
Nikos Livathinos
f46fd9c0a6
fix: Ocr AccleratorDevice
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-10 15:23:56 +00:00
Nikos Livathinos
94caee3fb5
fix: Correct the way to set GPU for EasyOCR, RapidOCR
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-10 15:07:54 +00:00
Nikos Livathinos
accb7b4481
fix: Do proper check to set the device in EasyOCR, RapidOCR.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-10 14:46:21 +00:00
Christoph Auer
bb1774dd6b
Merge branch 'release_v3' into nli/performance
2024-12-09 16:52:54 +01:00
Christoph Auer
9e99e242dc
Rebase from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:26 +01:00
Nikos Livathinos
78f61a8522
fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. ( #544 )
...
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Silence the tqdm messages during the downloading of model files
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Code styling
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Use the HF API to disable the tqdm progress bars
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-09 15:57:37 +01:00
Christoph Auer
46ae215b68
Updated test ground-truth (again), bugfix for empty layout
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:50:04 +01:00
Christoph Auer
aca57f0527
feat: docling-parse v2 as default PDF backend ( #549 )
...
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade to ds-glm 1.0 and docling-parse 3.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix DP2 backend code, change CLI default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
Christoph Auer
03f8690c62
Updated test ground-truth
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:19:38 +01:00
Nikos Livathinos
5d5d14d00c
fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-09 11:12:28 +01:00
github-actions[bot]
9fd2cf847a
chore: bump version to 2.9.0 [skip ci]
2024-12-09 09:33:55 +00:00
Panos Vagenas
c8ecdd987e
feat: expose new hybrid chunker, update docs ( #384 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-09 08:28:29 +01:00
Maxim Lysak
eb7ffcdd1c
fix: Correcting DefaultText ID for MS Word backend ( #537 )
...
Correcting DefaultText ID for MS Word backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:48:35 +01:00
Maxim Lysak
3e073dfbeb
feat(MS Word backend): Make detection of headers and other styles localization agnostic ( #534 )
...
Using style id instead of style names, which should be localization agnostic
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:17:56 +01:00
Nikos Livathinos
975fe076f4
fix: Improve the pydantic objects in the pipeline_options and imports.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-06 14:56:35 +01:00
Michele Dolfi
53039a8367
ci: allow ! in conventionalcommits ( #533 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 14:50:10 +01:00
Sander Maijers
9102fe1adc
fix: Add py.typed
marker file ( #531 )
...
feat: add `py.typed` marker file
See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information
Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
2024-12-06 13:42:14 +01:00
Panos Vagenas
e780333440
docs: document new integrations ( #532 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-06 13:18:14 +01:00
Peter W. J. Staar
0d11e30dd8
fix: Enable HTML export in CLI and add options for image mode ( #513 )
...
* updated README
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed duck in title
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the index.md
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the cli to export html
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added html to cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed the duck emoji, added the in the cli. Currently, the referenced seems broken
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* cleaning up the comments
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reference is now working
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Clean up styling and docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pin docling-core>=2.7.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Maxim Lysak
b730b2d7a0
fix: Missing text in docx (t tag) when embedded in a table ( #528 )
...
Fix for missing text in docx (t tag) when embedded in a table
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 12:37:25 +01:00
Christoph Auer
6f0b91287c
Rebase from release_v3
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:33:38 +01:00
Christoph Auer
40d7a8e293
Merge pull request #504 from DS4SD/cau/layout-postprocessing
...
feat: Support hierarchical layout components, expose and group content in pictures, forms and key-value regions
2024-12-06 12:26:34 +01:00
Michele Dolfi
c830b92b2e
fix: restore pydantic version pin after fixes ( #512 )
...
* test: pin new docling-core changes and release pydantic pinning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin docling-core release
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 09:33:39 +01:00
Nikos Livathinos
ddb8ad9227
feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
...
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
2024-12-04 17:29:09 +01:00
Christoph Auer
8b04edd177
Clean up imports again
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 15:22:43 +01:00
Michele Dolfi
8ada0bccc7
fix: folder input in cli ( #511 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-04 14:22:00 +01:00
Christoph Auer
e8266425ac
Merge branch 'release_v3' of github.com:DS4SD/docling into cau/layout-postprocessing
2024-12-04 14:21:09 +01:00
Christoph Auer
a1ac0c66ef
Move to_docling_document from ds-glm to this repo
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 13:11:41 +01:00
github-actions[bot]
9c788ae778
chore: bump version to 2.8.3 [skip ci]
2024-12-03 15:16:47 +00:00
Christoph Auer
65fa584a1a
Pass nested clusters through GLM as payload
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:58:27 +01:00
Christoph Auer
db70916f57
Pass nested cluster processing through full pipeline
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:08:45 +01:00
Christoph Auer
34c7c79858
fix: improve handling of disallowed formats ( #429 )
...
* fix: Fixes and tests for StopIteration on .convert()
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Remove unnecessary case handling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Other test fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* improve handling of unsupported types
- Introduced new explicit exception types instead of `RuntimeError`
- Introduced new `ConversionStatus` value for unsupported formats
- Tidied up converter member typing & removed asserts
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* robustify & simplify format option resolution
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* rename new status, populate ConversionResult errors
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-03 12:45:32 +01:00
github-actions[bot]
2254845da3
chore: bump version to 2.8.2 [skip ci]
2024-12-03 10:47:29 +00:00
Michele Dolfi
672962a8b2
chore: update numpy lock ( #500 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-03 11:21:31 +01:00
guglie
c90c41c391
fix: ParserError EOF inside string ( #470 ) ( #472 )
...
Signed-off-by: guglie <gdguglie@gmail.com>
2024-12-03 11:21:18 +01:00
Michele Dolfi
5ba3807f31
docs: add styling for faq ( #502 )
...
* docs: add styling to faq
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove torchaudio
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-03 11:20:49 +01:00
Panos Vagenas
051789d017
perf: prevent temp file leftovers, reuse core type ( #487 )
...
* chore: reuse DocumentStream from docling-core
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* update docling-core version
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* [skip ci] document import line
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490 )
use new resolve_source_to_x functions
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-12-03 10:40:28 +01:00
Christoph Auer
05bffd38f3
Implement hierachical cluster layout processing
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:28:36 +01:00
Gaspard Petit
d3f84b2457
fix: PermissionError when using tesseract_ocr_cli_model ( #496 )
...
Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>
2024-12-03 10:22:03 +01:00
Christoph Auer
b9f8f5ac7b
Upgraded Layout Postprocessing, sending old code back to ERZ
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-02 16:46:51 +01:00
Álvaro Huertas
33cff98d36
docs: typo in faq ( #484 )
...
Typo faq.md
Signed-off-by: Álvaro Huertas <123009293+huertin03@users.noreply.github.com>
2024-12-02 10:35:24 +01:00
Michele Dolfi
d4872103b8
docs: add automatic api reference ( #475 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-02 09:55:52 +01:00
Michele Dolfi
8ccb3c6db6
docs: introduce faq section ( #468 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-29 22:34:56 +01:00
github-actions[bot]
cc46c938b6
chore: bump version to 2.8.1 [skip ci]
2024-11-29 13:04:48 +00:00
Michele Dolfi
dd8de46267
fix(cli): expose debug options ( #467 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-29 13:25:58 +01:00
Michele Dolfi
af63818df5
fix: remove unused deps ( #466 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-29 13:18:06 +01:00
Panos Vagenas
84c46fdeb3
docs: extend integration docs & README ( #456 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-28 09:41:21 +01:00
github-actions[bot]
211f4f7570
chore: bump version to 2.8.0 [skip ci]
2024-11-27 13:29:32 +00:00