Commit Graph

291 Commits

Author SHA1 Message Date
lucas-morin
c024275f24 (feat): Create a XML backend for PubMed documents based on the pubmed_parser library (merge conflicts) 2024-12-10 13:35:29 +01:00
github-actions[bot]
4db16aa82b chore: bump version to 2.10.0 [skip ci] 2024-12-10 13:27:38 +01:00
Christoph Auer
c5b7c2f510 fix: Call into docling-core for legacy document transform (#551)
Call into docling-core for legacy document transform

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 13:27:38 +01:00
Nikos Livathinos
5c4f84a4bf fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544)
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Silence the tqdm messages during the downloading of model files

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Code styling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Use the HF API to disable the tqdm progress bars

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-10 13:27:38 +01:00
Christoph Auer
eff7970002 feat: docling-parse v2 as default PDF backend (#549)
* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to ds-glm 1.0 and docling-parse 3.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix DP2 backend code, change CLI default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 13:27:38 +01:00
github-actions[bot]
df4c9fd27b chore: bump version to 2.9.0 [skip ci] 2024-12-10 13:27:38 +01:00
Panos Vagenas
8e6f7c2305 feat: expose new hybrid chunker, update docs (#384)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-10 13:27:38 +01:00
Maxim Lysak
b15d71ba6f fix: Correcting DefaultText ID for MS Word backend (#537)
Correcting DefaultText ID for MS Word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-10 13:27:38 +01:00
Maxim Lysak
27c9476e52 feat(MS Word backend): Make detection of headers and other styles localization agnostic (#534)
Using style id instead of style names, which should be localization agnostic

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-10 13:27:38 +01:00
Michele Dolfi
b3515f89ce ci: allow ! in conventionalcommits (#533)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:27:38 +01:00
Sander Maijers
db580b4959 fix: Add py.typed marker file (#531)
feat: add `py.typed` marker file

See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information

Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
2024-12-10 13:27:38 +01:00
Panos Vagenas
b2a430a833 docs: document new integrations (#532)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-10 13:27:38 +01:00
Peter W. J. Staar
3e91514d2e fix: Enable HTML export in CLI and add options for image mode (#513)
* updated README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed duck in title

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli to export html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added html to cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the duck emoji, added the  in the cli. Currently, the referenced seems broken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaning up the comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reference is now working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Clean up styling and docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-core>=2.7.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-10 13:27:28 +01:00
Maxim Lysak
38a3e8decf fix: Missing text in docx (t tag) when embedded in a table (#528)
Fix for missing text in docx (t tag) when embedded in a table

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-10 13:27:12 +01:00
Michele Dolfi
592179630d fix: restore pydantic version pin after fixes (#512)
* test: pin new docling-core changes and release pydantic pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:27:09 +01:00
Michele Dolfi
228c3d107e fix: folder input in cli (#511)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:26:22 +01:00
github-actions[bot]
6fc1710cb8 chore: bump version to 2.8.3 [skip ci] 2024-12-10 13:26:22 +01:00
Christoph Auer
319a7efe16 fix: improve handling of disallowed formats (#429)
* fix: Fixes and tests for StopIteration on .convert()

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Remove unnecessary case handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Other test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* improve handling of unsupported types

- Introduced new explicit exception types instead of `RuntimeError`
- Introduced new `ConversionStatus` value for unsupported formats
- Tidied up converter member typing & removed asserts

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* robustify & simplify format option resolution

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* rename new status, populate ConversionResult errors

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-10 13:26:20 +01:00
github-actions[bot]
90c708c89e chore: bump version to 2.8.2 [skip ci] 2024-12-10 13:25:05 +01:00
Michele Dolfi
b6b9817429 chore: update numpy lock (#500)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:25:02 +01:00
guglie
d1244a5c31 fix: ParserError EOF inside string (#470) (#472)
Signed-off-by: guglie <gdguglie@gmail.com>
2024-12-10 13:24:43 +01:00
Michele Dolfi
756005e271 docs: add styling for faq (#502)
* docs: add styling to faq

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torchaudio

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:24:43 +01:00
Panos Vagenas
b80b35c7c9 perf: prevent temp file leftovers, reuse core type (#487)
* chore: reuse DocumentStream from docling-core

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* update docling-core version

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* [skip ci] document  import line

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490)

use new resolve_source_to_x functions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-12-10 13:24:36 +01:00
Gaspard Petit
2f4d38f4da fix: PermissionError when using tesseract_ocr_cli_model (#496)
Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>
2024-12-10 13:24:04 +01:00
Álvaro Huertas
7c195829f3 docs: typo in faq (#484)
Typo faq.md

Signed-off-by: Álvaro Huertas <123009293+huertin03@users.noreply.github.com>
2024-12-10 13:24:04 +01:00
Michele Dolfi
6a8ad8a3eb docs: add automatic api reference (#475)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:24:00 +01:00
Michele Dolfi
5aec43397d docs: introduce faq section (#468)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:22:33 +01:00
github-actions[bot]
b52fce3f27 chore: bump version to 2.8.1 [skip ci] 2024-12-10 13:22:33 +01:00
Michele Dolfi
d4c5d9a893 fix(cli): expose debug options (#467)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:22:33 +01:00
Michele Dolfi
76e6d93ce2 fix: remove unused deps (#466)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:22:16 +01:00
lucas-morin
71231790dc feat: Create XML backend for PubMed documents and resolve conflicts 2024-12-10 13:18:40 +01:00
lucas-morin
a06ee134dc Merge remote-tracking branch 'origin/main' into dev/xml-backend 2024-12-09 16:24:49 +01:00
lucas-morin
dd214b2b6e fix conflicts 2024-12-09 16:24:20 +01:00
Nikos Livathinos
78f61a8522
fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544)
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Silence the tqdm messages during the downloading of model files

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Code styling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Use the HF API to disable the tqdm progress bars

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-09 15:57:37 +01:00
lucas-morin
3240955db9 Create a XML backend for PubMed documents based on the pubmed_parser library 2024-12-09 15:50:10 +01:00
Christoph Auer
aca57f0527
feat: docling-parse v2 as default PDF backend (#549)
* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to ds-glm 1.0 and docling-parse 3.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix DP2 backend code, change CLI default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
github-actions[bot]
9fd2cf847a chore: bump version to 2.9.0 [skip ci] 2024-12-09 09:33:55 +00:00
Panos Vagenas
c8ecdd987e
feat: expose new hybrid chunker, update docs (#384)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-09 08:28:29 +01:00
Maxim Lysak
eb7ffcdd1c
fix: Correcting DefaultText ID for MS Word backend (#537)
Correcting DefaultText ID for MS Word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:48:35 +01:00
Maxim Lysak
3e073dfbeb
feat(MS Word backend): Make detection of headers and other styles localization agnostic (#534)
Using style id instead of style names, which should be localization agnostic

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:17:56 +01:00
Michele Dolfi
53039a8367
ci: allow ! in conventionalcommits (#533)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 14:50:10 +01:00
Sander Maijers
9102fe1adc
fix: Add py.typed marker file (#531)
feat: add `py.typed` marker file

See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information

Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
2024-12-06 13:42:14 +01:00
Panos Vagenas
e780333440
docs: document new integrations (#532)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-06 13:18:14 +01:00
Peter W. J. Staar
0d11e30dd8
fix: Enable HTML export in CLI and add options for image mode (#513)
* updated README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed duck in title

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli to export html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added html to cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the duck emoji, added the  in the cli. Currently, the referenced seems broken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaning up the comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reference is now working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Clean up styling and docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-core>=2.7.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Maxim Lysak
b730b2d7a0
fix: Missing text in docx (t tag) when embedded in a table (#528)
Fix for missing text in docx (t tag) when embedded in a table

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 12:37:25 +01:00
Michele Dolfi
c830b92b2e
fix: restore pydantic version pin after fixes (#512)
* test: pin new docling-core changes and release pydantic pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 09:33:39 +01:00
lucas-morin
7867014d0b Create a XML backend for PubMed documents based on the pubmed_parser library 2024-12-05 13:20:00 +01:00
lucas-morin
6c818d0926 Create a XML backend for PubMed documents based on the pubmed_parser library 2024-12-05 13:18:22 +01:00
Michele Dolfi
8ada0bccc7
fix: folder input in cli (#511)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-04 14:22:00 +01:00
github-actions[bot]
9c788ae778 chore: bump version to 2.8.3 [skip ci] 2024-12-03 15:16:47 +00:00