Peter Staar
7368013669
reformatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-19 06:31:57 +01:00
Peter Staar
8c42f760a2
merged with main and resolved all conflicts
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-19 06:26:42 +01:00
Maxim Lysak
7a97d7119f
feat: Extracting picture data for raster images found in PPTX ( #349 )
...
* Added picture data for pptx pictures
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added tests for pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Inferring image DPI from pptx file
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-18 15:22:28 +01:00
Michele Dolfi
7dbdbdeaf3
ci: fix mergify ( #350 )
...
* no conv commit message
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix mergify rules
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 17:13:01 +01:00
Michele Dolfi
364d37ca96
ci(Mergify): configuration update ( #339 )
...
* ci(Mergify): configuration update
Signed-off-by: Michele Dolfi <null>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove conventionalcommits from the checklist
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <null>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 13:18:33 +01:00
Michele Dolfi
ca8524ecae
docs: add automatic generation of CLI reference ( #325 )
...
* docs: add automatic generation of CLI reference
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* install deps for building CLI ref
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 13:18:17 +01:00
Panos Vagenas
25fd149c38
docs: add architecture outline ( #341 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-15 12:52:41 +01:00
Carl
835e077b02
docs: fix parameter in usage.md ( #332 )
...
Signed-off-by: Carl Senze <carl.senze@aleph-alpha.com>
Co-authored-by: Carl Senze <carl.senze@aleph-alpha.com>
2024-11-15 09:24:15 +01:00
Maxim Lysak
8533039b0c
fix: Fixing images in the input Word files ( #330 )
...
* Fixing images identification in the input Word files
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Populating extracted image data into docling picture for wordx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* removed base64 dependency in msword_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-14 13:33:34 +01:00
Panos Vagenas
bf2a85f1d4
chore: fix Qdrant notebook Colab link ( #319 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-14 10:42:02 +01:00
Peter Staar
f4fc6cfd4a
added TableFormerMode.ACCURATE as default in cli
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-14 07:45:36 +01:00
Michele Dolfi
8b437adcde
fix: reduce logging by keeping option for more verbose ( #323 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-13 10:08:24 +01:00
github-actions[bot]
5a44236ac2
chore: bump version to 2.5.2 [skip ci]
2024-11-13 08:19:09 +00:00
Michele Dolfi
c9341bf22e
fix: skip glm model downloads ( #322 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-13 08:45:28 +01:00
github-actions[bot]
2c0c439a44
chore: bump version to 2.5.1 [skip ci]
2024-11-12 14:56:34 +00:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend ( #314 )
...
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-12 15:20:55 +01:00
Anush
7f5d35ea3c
docs: Hybrid RAG with Qdrant ( #312 )
...
Signed-off-by: Anush008 <anushshetty90@gmail.com>
2024-11-12 15:18:14 +01:00
Panos Vagenas
93fc1be61a
docs: add Data Prep Kit integration ( #316 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-12 12:21:48 +01:00
github-actions[bot]
777237ebc9
chore: bump version to 2.5.0 [skip ci]
2024-11-12 10:19:55 +00:00
Christoph Auer
5d4a10b121
fix: Configure env prefix for docling settings ( #315 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-12 10:57:16 +01:00
Nikos Livathinos
c6b3763ecb
feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning ( #290 )
...
- When the OCR is forced, any existing PDF cells are rejected.
- Introduce the force-ocr cmd parameter in docling CLI.
- Update unit tests.
- Add the full_page_ocr.py example in mkdocs.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-12 09:46:14 +01:00
Maxim Lysak
81c8243a8b
fix: Added handling of grouped elements in pptx backend ( #307 )
...
* Added handling of grouped elements in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated log.warn to warning
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-11 16:38:21 +01:00
Maxim Lysak
53bf2d1790
Added handling of code blocks in html with <pre> tag ( #302 )
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-11 15:00:11 +01:00
Panos Vagenas
1239ade275
docs: add navigation indices ( #305 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-11 14:49:06 +01:00
Michele Dolfi
97f214efdd
fix: allow mps usage for easyocr ( #286 )
...
* fix: allow mps usage for easyocr
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add example for cpu-only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* comment out example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-10 14:26:17 +01:00
github-actions[bot]
be8aa17291
chore: bump version to 2.4.2 [skip ci]
2024-11-08 16:31:47 +00:00
Nikos Livathinos
0eb065e9b6
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr ( #282 )
...
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr without GPU if MPS is available.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-08 16:48:41 +01:00
github-actions[bot]
118f162e64
chore: bump version to 2.4.1 [skip ci]
2024-11-08 12:37:36 +00:00
Nikos Livathinos
704d792a79
fix(tesserocr): Raise Exception if tesserocr has not loaded any languages ( #279 )
...
fix(TesseractOcrModel): Raise Exception if tesserocr has not loaded any languages. Provide a descriptive error message.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-08 13:03:09 +01:00
Peter Staar
9e54a74410
another fix to the tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-08 12:48:53 +01:00
Peter Staar
311640fb9d
reformatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-08 05:41:09 +01:00
Peter Staar
5c82ff9890
fixed the tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-07 05:15:13 +01:00
Peter Staar
b154d4f2d7
updated ground-truth
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-06 10:55:18 +01:00
Peter Staar
0a5817a36e
updated the html tests (2)
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-06 05:46:09 +01:00
Peter Staar
c7b9792d6b
updated the html tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-06 05:44:50 +01:00
Panos Vagenas
6c22cba0a7
chore: add issue templates ( #251 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 23:18:20 +01:00
Ikko Eltociear Ashimine
c3098e3c12
chore: fix typo ( #241 )
...
* chore: update pypdfium2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
* chore: update docling_parse_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
* chore: update docling_parse_v2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
---------
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
2024-11-05 16:20:04 +01:00
Panos Vagenas
a84ec276b0
docs: update badges & credits ( #248 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 13:57:06 +01:00
Anthony R
90836db90a
fix: Dockerfile example copy command ( #234 )
...
Signed-off-by: Anthony R <anthonyringoet@gmail.com>
2024-11-05 12:48:27 +01:00
Panos Vagenas
5ce02c5c59
docs: add coming-soon section ( #235 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 08:53:02 +01:00
Panos Vagenas
d5e65aedac
docs: add artifacts-path param to CLI ( #233 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 08:51:21 +01:00
Peter Staar
ddd1474c8d
reformatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-05 07:25:21 +01:00
Peter Staar
3257034631
replace new lines and double spaces in list-items with single spaces
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-05 07:24:31 +01:00
Peter Staar
f276c0cc90
updated the html backend to add svg, remove empty list-items and use data-content fields
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-05 06:37:43 +01:00
github-actions[bot]
e30a9c25a2
chore: bump version to 2.4.0 [skip ci]
2024-11-04 15:11:09 +00:00
Panos Vagenas
862d78d271
chore: update pyproject.toml metadata ( #229 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 15:48:00 +01:00
Panos Vagenas
eeee3b4371
docs: add explicit artifacts path example ( #224 )
...
* docs: add explicit artifacts path example
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* minor docs fix
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* touch to trigger needed checks
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 14:27:56 +01:00
Michele Dolfi
5f5fea90a9
docs: update custom convert and dockerfile ( #226 )
...
* docs: remove old code from custom_convert.py
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* docs: update example Dockerfile
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-04 14:27:40 +01:00
Vicky Sekhon
41acaa9e2e
docs: correct spelling of 'individual' ( #219 )
...
Signed-off-by: Vicky Sekhon <114193273+VickySekhon@users.noreply.github.com>
2024-11-04 14:27:02 +01:00
Michele Dolfi
40ad987303
feat: pdf backend, table mode as options and artifacts path ( #203 )
...
* feat: add more options in the CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update CLI docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* expose artifacts-path as argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-04 14:26:05 +01:00