Panos Vagenas
bf2a85f1d4
chore: fix Qdrant notebook Colab link ( #319 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-14 10:42:02 +01:00
Peter Staar
f4fc6cfd4a
added TableFormerMode.ACCURATE as default in cli
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-14 07:45:36 +01:00
Michele Dolfi
8b437adcde
fix: reduce logging by keeping option for more verbose ( #323 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-13 10:08:24 +01:00
github-actions[bot]
5a44236ac2
chore: bump version to 2.5.2 [skip ci]
2024-11-13 08:19:09 +00:00
Michele Dolfi
c9341bf22e
fix: skip glm model downloads ( #322 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-13 08:45:28 +01:00
github-actions[bot]
2c0c439a44
chore: bump version to 2.5.1 [skip ci]
2024-11-12 14:56:34 +00:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend ( #314 )
...
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-12 15:20:55 +01:00
Anush
7f5d35ea3c
docs: Hybrid RAG with Qdrant ( #312 )
...
Signed-off-by: Anush008 <anushshetty90@gmail.com>
2024-11-12 15:18:14 +01:00
Panos Vagenas
93fc1be61a
docs: add Data Prep Kit integration ( #316 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-12 12:21:48 +01:00
github-actions[bot]
777237ebc9
chore: bump version to 2.5.0 [skip ci]
2024-11-12 10:19:55 +00:00
Christoph Auer
5d4a10b121
fix: Configure env prefix for docling settings ( #315 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-12 10:57:16 +01:00
Nikos Livathinos
c6b3763ecb
feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning ( #290 )
...
- When the OCR is forced, any existing PDF cells are rejected.
- Introduce the force-ocr cmd parameter in docling CLI.
- Update unit tests.
- Add the full_page_ocr.py example in mkdocs.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-12 09:46:14 +01:00
Maxim Lysak
81c8243a8b
fix: Added handling of grouped elements in pptx backend ( #307 )
...
* Added handling of grouped elements in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated log.warn to warning
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-11 16:38:21 +01:00
Maxim Lysak
53bf2d1790
Added handling of code blocks in html with <pre> tag ( #302 )
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-11 15:00:11 +01:00
Panos Vagenas
1239ade275
docs: add navigation indices ( #305 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-11 14:49:06 +01:00
Michele Dolfi
97f214efdd
fix: allow mps usage for easyocr ( #286 )
...
* fix: allow mps usage for easyocr
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add example for cpu-only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* comment out example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-10 14:26:17 +01:00
github-actions[bot]
be8aa17291
chore: bump version to 2.4.2 [skip ci]
2024-11-08 16:31:47 +00:00
Nikos Livathinos
0eb065e9b6
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr ( #282 )
...
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr without GPU if MPS is available.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-08 16:48:41 +01:00
github-actions[bot]
118f162e64
chore: bump version to 2.4.1 [skip ci]
2024-11-08 12:37:36 +00:00
Nikos Livathinos
704d792a79
fix(tesserocr): Raise Exception if tesserocr has not loaded any languages ( #279 )
...
fix(TesseractOcrModel): Raise Exception if tesserocr has not loaded any languages. Provide a descriptive error message.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-08 13:03:09 +01:00
Peter Staar
9e54a74410
another fix to the tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-08 12:48:53 +01:00
Peter Staar
311640fb9d
reformatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-08 05:41:09 +01:00
Peter Staar
5c82ff9890
fixed the tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-07 05:15:13 +01:00
Peter Staar
b154d4f2d7
updated ground-truth
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-06 10:55:18 +01:00
Peter Staar
0a5817a36e
updated the html tests (2)
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-06 05:46:09 +01:00
Peter Staar
c7b9792d6b
updated the html tests
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-06 05:44:50 +01:00
Panos Vagenas
6c22cba0a7
chore: add issue templates ( #251 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 23:18:20 +01:00
Ikko Eltociear Ashimine
c3098e3c12
chore: fix typo ( #241 )
...
* chore: update pypdfium2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
* chore: update docling_parse_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
* chore: update docling_parse_v2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
---------
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
2024-11-05 16:20:04 +01:00
Panos Vagenas
a84ec276b0
docs: update badges & credits ( #248 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 13:57:06 +01:00
Anthony R
90836db90a
fix: Dockerfile example copy command ( #234 )
...
Signed-off-by: Anthony R <anthonyringoet@gmail.com>
2024-11-05 12:48:27 +01:00
Panos Vagenas
5ce02c5c59
docs: add coming-soon section ( #235 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 08:53:02 +01:00
Panos Vagenas
d5e65aedac
docs: add artifacts-path param to CLI ( #233 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 08:51:21 +01:00
Peter Staar
ddd1474c8d
reformatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-05 07:25:21 +01:00
Peter Staar
3257034631
replace new lines and double spaces in list-items with single spaces
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-05 07:24:31 +01:00
Peter Staar
f276c0cc90
updated the html backend to add svg, remove empty list-items and use data-content fields
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-05 06:37:43 +01:00
github-actions[bot]
e30a9c25a2
chore: bump version to 2.4.0 [skip ci]
2024-11-04 15:11:09 +00:00
Panos Vagenas
862d78d271
chore: update pyproject.toml metadata ( #229 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 15:48:00 +01:00
Panos Vagenas
eeee3b4371
docs: add explicit artifacts path example ( #224 )
...
* docs: add explicit artifacts path example
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* minor docs fix
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* touch to trigger needed checks
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 14:27:56 +01:00
Michele Dolfi
5f5fea90a9
docs: update custom convert and dockerfile ( #226 )
...
* docs: remove old code from custom_convert.py
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* docs: update example Dockerfile
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-04 14:27:40 +01:00
Vicky Sekhon
41acaa9e2e
docs: correct spelling of 'individual' ( #219 )
...
Signed-off-by: Vicky Sekhon <114193273+VickySekhon@users.noreply.github.com>
2024-11-04 14:27:02 +01:00
Michele Dolfi
40ad987303
feat: pdf backend, table mode as options and artifacts path ( #203 )
...
* feat: add more options in the CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update CLI docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* expose artifacts-path as argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-04 14:26:05 +01:00
Johnny Salazar
af323c04ef
fit: Specify encoding when writing output file ( #214 )
...
Specify encoding when writing output file to avoid errors when default target encoding doesn't have all characters. utf8 seems like the most universal and supported encoding. Otherwise, the cli fails with encoding errors when input file contains unicode text (basically most files nowadays) and the target system has default encoding set to some one-byte charset like cp1252
Signed-off-by: Johnny Salazar <cepera.ang@gmail.com>
2024-11-04 14:24:13 +01:00
Panos Vagenas
8fb445f46c
chore: make tests lighter ( #228 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 14:02:28 +01:00
Peter Staar
5fc4d5bd3d
work-in-progress: dealing with in attributes of html elements
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-02 09:27:07 +01:00
Panos Vagenas
244ca69cfd
docs: update LlamaIndex docs ( #196 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-01 20:55:28 +01:00
Peter Staar
473ad9a032
add the skip_furniture parameter
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-01 11:32:56 +01:00
Peter Staar
ebe0b203c8
added the detection of h1 and the skip_furniture parameter
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-31 16:06:41 +01:00
Peter Staar
c52e68c52b
feat: add ability to detect h1 and filter from there-on
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-31 15:50:26 +01:00
github-actions[bot]
9d8865856d
chore: bump version to 2.3.1 [skip ci]
2024-10-30 18:23:53 +00:00
Michele Dolfi
eb679ccbb4
fix: simplify torch dependencies and update pinned docling deps ( #190 )
...
* fix: simplify torch dependencies and update pinned docling deps
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update docling-ibm-models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-30 18:44:08 +01:00