Christoph Auer
5a82f2b51e
Rebase from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-11 12:30:45 +01:00
Christoph Auer
aee9c0b324
fix: Do not import python modules from deepsearch-glm ( #569 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-11 12:29:06 +01:00
Christoph Auer
f4512d0e97
Update HF model ref, reset test generate
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 20:05:42 +01:00
Christoph Auer
c8b59151d7
Update tests
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 20:05:42 +01:00
Christoph Auer
bd30b46356
Update lockfile
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 20:05:42 +01:00
Christoph Auer
586abd58ec
Rebase from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:32:48 +01:00
Christoph Auer
cd579fd28e
Merge pull request #556 from DS4SD/cau/layout-processing-improvement
...
feat: layout processing improvements and bugfixes
2024-12-10 16:29:07 +01:00
Christoph Auer
e282bfd8c8
Merge pull request #514 from DS4SD/nli/performance
...
feat(Accelerator): Introduce AI runtime configuration scheme
2024-12-10 16:26:27 +01:00
Christoph Auer
f45499ce93
fix: Handle no result from RapidOcr reader ( #558 )
...
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-12-10 16:25:05 +01:00
Nikos Livathinos
f46fd9c0a6
fix: Ocr AccleratorDevice
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-10 15:23:56 +00:00
Nikos Livathinos
94caee3fb5
fix: Correct the way to set GPU for EasyOCR, RapidOCR
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-10 15:07:54 +00:00
Panos Vagenas
d0c9e8e508
docs: update chunking usage docs, minor reorg ( #550 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-10 16:03:02 +01:00
Nikos Livathinos
accb7b4481
fix: Do proper check to set the device in EasyOCR, RapidOCR.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-10 14:46:21 +00:00
Michele Dolfi
a7df337654
fix: make enum serializable with human-readable value ( #555 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:12:44 +01:00
github-actions[bot]
eb30c4f763
chore: bump version to 2.10.0 [skip ci]
2024-12-09 16:28:46 +00:00
Christoph Auer
7972d47f88
fix: Call into docling-core for legacy document transform ( #551 )
...
Call into docling-core for legacy document transform
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 17:06:47 +01:00
Christoph Auer
bb1774dd6b
Merge branch 'release_v3' into nli/performance
2024-12-09 16:52:54 +01:00
Christoph Auer
9e99e242dc
Rebase from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:26 +01:00
Nikos Livathinos
78f61a8522
fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. ( #544 )
...
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Silence the tqdm messages during the downloading of model files
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Code styling
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Use the HF API to disable the tqdm progress bars
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-09 15:57:37 +01:00
Christoph Auer
46ae215b68
Updated test ground-truth (again), bugfix for empty layout
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:50:04 +01:00
Christoph Auer
aca57f0527
feat: docling-parse v2 as default PDF backend ( #549 )
...
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade to ds-glm 1.0 and docling-parse 3.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix DP2 backend code, change CLI default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
Christoph Auer
03f8690c62
Updated test ground-truth
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:19:38 +01:00
Nikos Livathinos
5d5d14d00c
fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-09 11:12:28 +01:00
github-actions[bot]
9fd2cf847a
chore: bump version to 2.9.0 [skip ci]
2024-12-09 09:33:55 +00:00
Panos Vagenas
c8ecdd987e
feat: expose new hybrid chunker, update docs ( #384 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-09 08:28:29 +01:00
Maxim Lysak
eb7ffcdd1c
fix: Correcting DefaultText ID for MS Word backend ( #537 )
...
Correcting DefaultText ID for MS Word backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:48:35 +01:00
Maxim Lysak
3e073dfbeb
feat(MS Word backend): Make detection of headers and other styles localization agnostic ( #534 )
...
Using style id instead of style names, which should be localization agnostic
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:17:56 +01:00
Nikos Livathinos
975fe076f4
fix: Improve the pydantic objects in the pipeline_options and imports.
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-06 14:56:35 +01:00
Michele Dolfi
53039a8367
ci: allow ! in conventionalcommits ( #533 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 14:50:10 +01:00
Sander Maijers
9102fe1adc
fix: Add py.typed
marker file ( #531 )
...
feat: add `py.typed` marker file
See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information
Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
2024-12-06 13:42:14 +01:00
Panos Vagenas
e780333440
docs: document new integrations ( #532 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-06 13:18:14 +01:00
Peter W. J. Staar
0d11e30dd8
fix: Enable HTML export in CLI and add options for image mode ( #513 )
...
* updated README
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed duck in title
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the index.md
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the cli to export html
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added html to cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed the duck emoji, added the in the cli. Currently, the referenced seems broken
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* cleaning up the comments
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reference is now working
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Clean up styling and docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pin docling-core>=2.7.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Maxim Lysak
b730b2d7a0
fix: Missing text in docx (t tag) when embedded in a table ( #528 )
...
Fix for missing text in docx (t tag) when embedded in a table
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 12:37:25 +01:00
Christoph Auer
6f0b91287c
Rebase from release_v3
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:33:38 +01:00
Christoph Auer
40d7a8e293
Merge pull request #504 from DS4SD/cau/layout-postprocessing
...
feat: Support hierarchical layout components, expose and group content in pictures, forms and key-value regions
2024-12-06 12:26:34 +01:00
Michele Dolfi
c830b92b2e
fix: restore pydantic version pin after fixes ( #512 )
...
* test: pin new docling-core changes and release pydantic pinning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin docling-core release
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 09:33:39 +01:00
Nikos Livathinos
ddb8ad9227
feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
...
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
2024-12-04 17:29:09 +01:00
Christoph Auer
8b04edd177
Clean up imports again
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 15:22:43 +01:00
Michele Dolfi
8ada0bccc7
fix: folder input in cli ( #511 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-04 14:22:00 +01:00
Christoph Auer
e8266425ac
Merge branch 'release_v3' of github.com:DS4SD/docling into cau/layout-postprocessing
2024-12-04 14:21:09 +01:00
Christoph Auer
a1ac0c66ef
Move to_docling_document from ds-glm to this repo
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 13:11:41 +01:00
github-actions[bot]
9c788ae778
chore: bump version to 2.8.3 [skip ci]
2024-12-03 15:16:47 +00:00
Christoph Auer
65fa584a1a
Pass nested clusters through GLM as payload
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:58:27 +01:00
Christoph Auer
db70916f57
Pass nested cluster processing through full pipeline
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:08:45 +01:00
Christoph Auer
34c7c79858
fix: improve handling of disallowed formats ( #429 )
...
* fix: Fixes and tests for StopIteration on .convert()
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Remove unnecessary case handling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Other test fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* improve handling of unsupported types
- Introduced new explicit exception types instead of `RuntimeError`
- Introduced new `ConversionStatus` value for unsupported formats
- Tidied up converter member typing & removed asserts
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* robustify & simplify format option resolution
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* rename new status, populate ConversionResult errors
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-03 12:45:32 +01:00
github-actions[bot]
2254845da3
chore: bump version to 2.8.2 [skip ci]
2024-12-03 10:47:29 +00:00
Michele Dolfi
672962a8b2
chore: update numpy lock ( #500 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-03 11:21:31 +01:00
guglie
c90c41c391
fix: ParserError EOF inside string ( #470 ) ( #472 )
...
Signed-off-by: guglie <gdguglie@gmail.com>
2024-12-03 11:21:18 +01:00
Michele Dolfi
5ba3807f31
docs: add styling for faq ( #502 )
...
* docs: add styling to faq
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove torchaudio
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-03 11:20:49 +01:00
Panos Vagenas
051789d017
perf: prevent temp file leftovers, reuse core type ( #487 )
...
* chore: reuse DocumentStream from docling-core
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* update docling-core version
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* [skip ci] document import line
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490 )
use new resolve_source_to_x functions
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-12-03 10:40:28 +01:00