Commit Graph

285 Commits

Author SHA1 Message Date
Christoph Auer
05c8cb0fba Update HF model ref, reset test generate
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 20:05:42 +01:00
Christoph Auer
1de42bef6a Update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 20:05:42 +01:00
Christoph Auer
5e013294f9 Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 20:05:42 +01:00
Christoph Auer
76a6b13a92 Rebase from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:32:48 +01:00
Christoph Auer
b66fb830c9 Merge pull request #556 from DS4SD/cau/layout-processing-improvement
feat: layout processing improvements and bugfixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:29:07 +01:00
Christoph Auer
184eed4095 Merge pull request #514 from DS4SD/nli/performance
feat(Accelerator): Introduce AI runtime configuration scheme 
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:26:27 +01:00
Christoph Auer
861e6fa90c fix: Handle no result from RapidOcr reader (#558)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:25:05 +01:00
Nikos Livathinos
5c69081453 fix: Ocr AccleratorDevice
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 15:23:56 +00:00
Nikos Livathinos
6bc1bd2ec4 fix: Correct the way to set GPU for EasyOCR, RapidOCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 15:07:54 +00:00
Panos Vagenas
6f986d26e1 docs: update chunking usage docs, minor reorg (#550)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 16:03:02 +01:00
Nikos Livathinos
99ccb69a47 fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 14:46:21 +00:00
Michele Dolfi
1a3daf2ffb fix: make enum serializable with human-readable value (#555)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 13:12:44 +01:00
github-actions[bot]
ca83a1f0c9 chore: bump version to 2.10.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:28:46 +00:00
Christoph Auer
440c16ff20 fix: Call into docling-core for legacy document transform (#551)
Call into docling-core for legacy document transform

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 17:06:47 +01:00
Christoph Auer
ce82e23b66 Merge branch 'release_v3' into nli/performance
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:54 +01:00
Christoph Auer
d006b937ad Rebase from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 16:52:26 +01:00
Nikos Livathinos
c21ada4b22 fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544)
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Silence the tqdm messages during the downloading of model files

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Code styling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Use the HF API to disable the tqdm progress bars

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 15:57:37 +01:00
Christoph Auer
fbb28b851d Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:50:04 +01:00
Christoph Auer
840f5e15ed feat: docling-parse v2 as default PDF backend (#549)
* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to ds-glm 1.0 and docling-parse 3.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix DP2 backend code, change CLI default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
Christoph Auer
731e48ea43 Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:19:38 +01:00
Nikos Livathinos
1149d3ae08 fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 11:12:28 +01:00
github-actions[bot]
d15d656c39 chore: bump version to 2.9.0 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 09:33:55 +00:00
Panos Vagenas
48d2cb3505 feat: expose new hybrid chunker, update docs (#384)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 08:28:29 +01:00
Maxim Lysak
dc71b8c004 fix: Correcting DefaultText ID for MS Word backend (#537)
Correcting DefaultText ID for MS Word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 15:48:35 +01:00
Maxim Lysak
c31d9f032e feat(MS Word backend): Make detection of headers and other styles localization agnostic (#534)
Using style id instead of style names, which should be localization agnostic

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 15:17:56 +01:00
Nikos Livathinos
f63e5ef3b5 fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 14:56:35 +01:00
Michele Dolfi
a38f57efce ci: allow ! in conventionalcommits (#533)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 14:50:10 +01:00
Sander Maijers
ba32fb8637 fix: Add py.typed marker file (#531)
feat: add `py.typed` marker file

See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information

Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 13:42:14 +01:00
Panos Vagenas
6f7b128867 docs: document new integrations (#532)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 13:18:14 +01:00
Peter W. J. Staar
54b4daa2dd fix: Enable HTML export in CLI and add options for image mode (#513)
* updated README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed duck in title

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli to export html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added html to cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the duck emoji, added the  in the cli. Currently, the referenced seems broken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaning up the comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reference is now working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Clean up styling and docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-core>=2.7.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Maxim Lysak
63f1125d5c fix: Missing text in docx (t tag) when embedded in a table (#528)
Fix for missing text in docx (t tag) when embedded in a table

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:25 +01:00
Christoph Auer
71f3a7ac3c Rebase from release_v3
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:33:38 +01:00
Christoph Auer
b0da1a2127 Merge pull request #504 from DS4SD/cau/layout-postprocessing
feat: Support hierarchical layout components, expose and group content in pictures, forms and key-value regions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:26:34 +01:00
Michele Dolfi
bed92b766f fix: restore pydantic version pin after fixes (#512)
* test: pin new docling-core changes and release pydantic pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 09:33:39 +01:00
Nikos Livathinos
3bb7df66ca feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 17:29:09 +01:00
Christoph Auer
84f3548d30 Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 15:22:43 +01:00
Michele Dolfi
e36f7d82f6 fix: folder input in cli (#511)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 14:22:00 +01:00
Christoph Auer
e97688cd3d Merge branch 'release_v3' of github.com:DS4SD/docling into cau/layout-postprocessing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 14:21:09 +01:00
Christoph Auer
11c7c43bad Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-04 13:11:41 +01:00
github-actions[bot]
78fad801fe chore: bump version to 2.8.3 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 15:16:47 +00:00
Christoph Auer
0240ae2930 Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:58:27 +01:00
Christoph Auer
4dcc738b6d Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 13:08:45 +01:00
Christoph Auer
0be736227f fix: improve handling of disallowed formats (#429)
* fix: Fixes and tests for StopIteration on .convert()

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Remove unnecessary case handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Other test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* improve handling of unsupported types

- Introduced new explicit exception types instead of `RuntimeError`
- Introduced new `ConversionStatus` value for unsupported formats
- Tidied up converter member typing & removed asserts

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* robustify & simplify format option resolution

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* rename new status, populate ConversionResult errors

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 12:45:32 +01:00
github-actions[bot]
25a0fa38d1 chore: bump version to 2.8.2 [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:47:29 +00:00
Michele Dolfi
9f35e368f6 chore: update numpy lock (#500)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:21:31 +01:00
guglie
a7e3f713bb fix: ParserError EOF inside string (#470) (#472)
Signed-off-by: guglie <gdguglie@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:21:18 +01:00
Michele Dolfi
a01cedbb69 docs: add styling for faq (#502)
* docs: add styling to faq

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torchaudio

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 11:20:49 +01:00
Panos Vagenas
418d8159bd perf: prevent temp file leftovers, reuse core type (#487)
* chore: reuse DocumentStream from docling-core

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* update docling-core version

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* [skip ci] document  import line

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490)

use new resolve_source_to_x functions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:40:28 +01:00
Christoph Auer
7245cc6080 Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:28:36 +01:00
Gaspard Petit
32e9b4a2cf fix: PermissionError when using tesseract_ocr_cli_model (#496)
Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-03 10:22:03 +01:00