Nikos Livathinos
dae2a3b667
fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests ( #138 )
...
* feat(OCR tests): Introduce fuzziness in the text validation of OCR tests
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2024-10-11 10:21:19 +02:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension ( #135 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-09 22:17:56 +02:00
Panos Vagenas
6924999f1f
chore: explicitly manage pandas dependency ( #134 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-09 14:50:39 +02:00
github-actions[bot]
0ffc1708d2
chore: bump version to 1.19.0 [skip ci]
v1.19.0
2024-10-08 17:42:29 +00:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines ( #118 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com >
Co-authored-by: Peter Staar <taa@zurich.ibm.com >
2024-10-08 19:07:08 +02:00
Fasal Shah
d412c363d7
fixed unload pdf backend resources ( #129 )
...
Signed-off-by: faisal shah <fashah@redhat.com >
Co-authored-by: faisal shah <fashah@redhat.com >
2024-10-08 10:46:43 +02:00
github-actions[bot]
9b82ae3324
chore: bump version to 1.18.0 [skip ci]
v1.18.0
2024-10-03 17:16:00 +00:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models ( #120 )
...
---------
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
2024-10-03 18:42:33 +02:00
github-actions[bot]
9ebbbc1245
chore: bump version to 1.17.0 [skip ci]
v1.17.0
2024-10-03 13:44:52 +00:00
Rui Dias Gomes
dde0aff8bd
update examples ( #123 )
...
Signed-off-by: rmdg88 <rmdg88@gmail.com >
2024-10-03 14:28:25 +02:00
Michele Dolfi
d44c62d7ce
feat: windows support ( #122 )
...
* feat: windows support
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add Windows in README
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-03 14:23:47 +02:00
github-actions[bot]
cde671cf34
chore: bump version to 1.16.1 [skip ci]
v1.16.1
2024-09-27 14:36:40 +00:00
Michele Dolfi
34bd887a7f
fix: allow usage of opencv 4.6.x ( #110 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-27 15:51:43 +02:00
Panos Vagenas
c05b692d69
docs: document chunking ( #111 )
...
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-27 11:16:04 +02:00
github-actions[bot]
6760571fe1
chore: bump version to 1.16.0 [skip ci]
v1.16.0
2024-09-27 06:21:15 +00:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice ( #90 )
...
* Support tableformer model choice
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update datamodel structure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add test unit for table options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Ensure import backwards-compatibility for PipelineOptions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update README
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Adjust parameters on custom_convert
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Update Dockerfile
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
2024-09-26 21:37:08 +02:00
Panos Vagenas
39977b5631
chore: move examples extras to respective group ( #103 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-25 15:47:48 +02:00
github-actions[bot]
3dfd02a7e9
chore: bump version to 1.15.0 [skip ci]
v1.15.0
2024-09-24 15:58:16 +00:00
Michele Dolfi
6a03c208ec
feat: add figure in markdown ( #98 )
...
* feat: add figures in markdown
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update to new docling-core and update test results with figures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update with improved docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-24 17:28:23 +02:00
github-actions[bot]
001d214a13
chore: bump version to 1.14.0 [skip ci]
v1.14.0
2024-09-24 13:38:23 +00:00
Panos Vagenas
d96b96c848
fix: fix OCR setting for pypdfium, minor refactor ( #102 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-24 14:36:00 +02:00
Panos Vagenas
f8f2303348
docs: document CLI, minor README revamp ( #100 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-24 09:21:28 +02:00
Panos Vagenas
f555815343
chore: add RAG notebook titles ( #101 )
...
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-24 09:17:46 +02:00
Panos Vagenas
3c46e4266c
feat: add URL support to CLI ( #99 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-24 08:47:53 +02:00
github-actions[bot]
c65a01c9b7
chore: bump version to 1.13.1 [skip ci]
v1.13.1
2024-09-23 19:04:01 +00:00
Peter W. J. Staar
4794ce460a
fix: updated the render_as_doctags with the new arguments from docling-core ( #93 )
...
* updated the render_as_doctags with the new arguments from docling-core
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the doctags tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the README
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fix poetry lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Fix formatting problems
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* fixed the doctag export in docling/utils/export.py
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* propagate xsize and ysize
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-23 20:12:18 +02:00
Maxim Lysak
dce9934a0f
Updated to new, clean vector logo, svg and rendered png are provided ( #96 )
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
2024-09-23 15:31:21 +02:00
Michele Dolfi
1f4b224ab6
chore: switch to gh apps user ( #92 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-20 17:02:27 +02:00
github-actions[bot]
6dd1e91c4a
chore: bump version to 1.13.0 [skip ci]
v1.13.0
2024-09-18 09:26:03 +00:00
Maxim Lysak
0da7519896
docs: updated Docling logo.png with transparent background ( #88 )
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
2024-09-18 10:39:11 +02:00
Michele Dolfi
f19bd43798
feat: add table exports ( #86 )
...
* feat: expose docling-core table exporters and add examples
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove temp internal implementation of html export
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* pin latest docling-core 1.4.0 with table exports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-18 08:44:13 +02:00
Peter W. J. Staar
442443a102
fix: bumped the glm version and adjusted the tests ( #83 )
...
* bumped the glm version and adjusted the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fix hooks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the tests for tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-18 07:43:49 +02:00
github-actions[bot]
8242bce4fa
chore: bump version to 1.12.2 [skip ci]
v1.12.2
2024-09-17 16:01:34 +00:00
Nikos Livathinos
fa9699fa3c
fix(tests): Adjust the test data to match the new version of LayoutPredictor ( #82 )
...
* fix(tests): Adjust the test data to match the new version of LayoutPredictor from docling-ibm-models
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* chore: Update poetry to use `docling-ibm-models` at version `v1.2.0`
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2024-09-17 15:50:35 +02:00
Michele Dolfi
30a0ef69b4
chore: Add PR template ( #81 )
...
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
2024-09-16 18:36:26 +02:00
github-actions[bot]
f1932fd8c5
chore: bump version to 1.12.1 [skip ci]
v1.12.1
2024-09-16 10:58:09 +00:00
Michele Dolfi
2870fdc857
fix: CLI compatibility with python 3.10 and 3.11 ( #79 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-16 12:32:45 +02:00
github-actions[bot]
34b2772a2e
chore: bump version to 1.12.0 [skip ci]
v1.12.0
2024-09-13 12:34:15 +00:00
Peter W. J. Staar
98990784df
feat: add docling cli ( #75 )
...
* chore: add simple convert script
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted all
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted all
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added default arg
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* use typer for the docling CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* describe output when saving
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add tests for CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add export options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-13 14:03:09 +02:00
Michele Dolfi
8aa476ccd3
test: improve typing definitions (part 1) ( #72 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-12 15:56:29 +02:00
Panos Vagenas
53569a1023
docs: showcase RAG with LlamaIndex and LangChain ( #71 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-11 15:07:08 +02:00
Michele Dolfi
79932b7d69
test: check for stable obj_type ( #70 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-11 12:53:59 +02:00
github-actions[bot]
e66dc53765
chore: bump version to 1.11.0 [skip ci]
v1.11.0
2024-09-10 16:18:59 +00:00
Peter W. J. Staar
bdfdfbf092
feat: adding txt and doctags output ( #68 )
...
* feat: adding txt and doctags output
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* cleaned up the export
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Fix datamodel usage for Figure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* updated all the examples to deal with new rendering
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-10 17:30:52 +02:00
github-actions[bot]
cd5b6293cc
chore: bump version to 1.10.0 [skip ci]
v1.10.0
2024-09-10 14:38:07 +00:00
Michele Dolfi
27a7a152e1
feat: linux arm64 support and reducing dependencies ( #69 )
...
* feat: linux arm64 support and reducing dependencies
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* downgrade pyarrow for wider support
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-10 15:43:27 +02:00
Panos Vagenas
1051eb9465
chore: update README ( #65 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-09 12:03:04 +02:00
Michele Dolfi
6f1811e050
chore: fix placeholders in license ( #63 )
...
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
2024-09-06 17:10:07 +02:00
github-actions[bot]
d3711437f6
chore: bump version to 1.9.0 [skip ci]
v1.9.0
2024-09-03 13:33:40 +00:00
Michele Dolfi
1de2e4f924
feat: export document pages as multimodal output ( #54 )
...
* feat: export document pages as multimodal output
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* create a single parquet output
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add loading into HF datasets library
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* renaming
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* cleanup
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-03 15:05:35 +02:00