Commit Graph

39 Commits

Author SHA1 Message Date
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension (#135)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Panos Vagenas
6924999f1f
chore: explicitly manage pandas dependency (#134)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 14:50:39 +02:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines (#118)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2024-10-08 19:07:08 +02:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models (#120)
---------

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-03 18:42:33 +02:00
Michele Dolfi
d44c62d7ce
feat: windows support (#122)
* feat: windows support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add Windows in README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 14:23:47 +02:00
Michele Dolfi
34bd887a7f
fix: allow usage of opencv 4.6.x (#110)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-27 15:51:43 +02:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice (#90)
* Support tableformer model choice

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update datamodel structure

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test unit for table options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Ensure import backwards-compatibility for PipelineOptions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update README

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust parameters on custom_convert

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-26 21:37:08 +02:00
Panos Vagenas
39977b5631
chore: move examples extras to respective group (#103)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-25 15:47:48 +02:00
Michele Dolfi
6a03c208ec
feat: add figure in markdown (#98)
* feat: add figures in markdown

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update to new docling-core and update test results with figures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update with improved docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-24 17:28:23 +02:00
Peter W. J. Staar
4794ce460a
fix: updated the render_as_doctags with the new arguments from docling-core (#93)
* updated the render_as_doctags with the new arguments from docling-core

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the doctags tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix poetry lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Fix formatting problems

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fixed the doctag export in docling/utils/export.py

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* propagate xsize and ysize

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-23 20:12:18 +02:00
Michele Dolfi
f19bd43798
feat: add table exports (#86)
* feat: expose docling-core table exporters and add examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove temp internal implementation of html export

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin latest docling-core 1.4.0 with table exports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-18 08:44:13 +02:00
Peter W. J. Staar
442443a102
fix: bumped the glm version and adjusted the tests (#83)
* bumped the glm version and adjusted the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the poetry lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix hooks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the tests for tables

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-18 07:43:49 +02:00
Nikos Livathinos
fa9699fa3c
fix(tests): Adjust the test data to match the new version of LayoutPredictor (#82)
* fix(tests): Adjust the test data to match the new version of LayoutPredictor from docling-ibm-models

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update poetry to use `docling-ibm-models` at version `v1.2.0`

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-09-17 15:50:35 +02:00
Peter W. J. Staar
98990784df
feat: add docling cli (#75)
* chore: add simple convert script

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted all

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted all

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added default arg

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* use typer for the docling CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* describe output when saving

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add tests for CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add export options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-13 14:03:09 +02:00
Michele Dolfi
8aa476ccd3
test: improve typing definitions (part 1) (#72)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-12 15:56:29 +02:00
Panos Vagenas
53569a1023
docs: showcase RAG with LlamaIndex and LangChain (#71)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-11 15:07:08 +02:00
Peter W. J. Staar
bdfdfbf092
feat: adding txt and doctags output (#68)
* feat: adding txt and doctags output

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaned up the export

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix datamodel usage for Figure

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* updated all the examples to deal with new rendering

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-10 17:30:52 +02:00
Michele Dolfi
27a7a152e1
feat: linux arm64 support and reducing dependencies (#69)
* feat: linux arm64 support and reducing dependencies

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* downgrade pyarrow for wider support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-10 15:43:27 +02:00
Michele Dolfi
1de2e4f924
feat: export document pages as multimodal output (#54)
* feat: export document pages as multimodal output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* create a single parquet output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add loading into HF datasets library

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cleanup

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-03 15:05:35 +02:00
Peter W. J. Staar
48f4d1ba52
fix: Add unit tests (#51)
* add the pytests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* renamed the test folder and added the toplevel test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the toplevel function test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* need to start running all tests successfully

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the reference converted documents

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added first test for json and md output

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* replaced deprecated json function with model_dump_json

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* replaced deprecated json function with model_dump_json

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix backend tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* commented out the drawing

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ci: avoid duplicate runs

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* commented out json verification for now

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added verification of input cells

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformat code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages (3)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* run all examples in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* make sure examples return failures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* raise a failure if examples fail

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* run examples after tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add tests and update top_level_tests using only datamodels

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unnecessary code

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Validate conversion status on e2e test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* package verify utils and add more tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* reduce docs in example, since they are already in the tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip batch_convert

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-parse 1.1.2

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* updated the error messages

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* commented out the json verification for now

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* bumped GLM version

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin new docling-parse v1.1.3

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 14:08:20 +02:00
Michele Dolfi
f49ee825c3
fix: table cells overlap and model warnings (#53)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-28 12:30:42 +02:00
Christoph Auer
7e84533299
fix: Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages (#45)
* Put safety-checks for failed parse of pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump to docling-parse 1.1.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-23 12:51:02 +02:00
Christoph Auer
a8c6b29a67
feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing (#44)
* Use docling-parse page-by-page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Propagate document_hash to PDF backends, use docling-parse 1.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* repin after more packages on pypi

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 13:49:37 +02:00
Michele Dolfi
69952682ed
fix: remove [ocr] extra to fix wheel install (#42)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 09:25:19 +02:00
Christoph Auer
f19871a5a1
fix: Add scipy as dependency (#40)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-21 17:21:02 +02:00
Christoph Auer
4a1ceaf65c
Update docling-ibm-models to v1.1.2 (#39)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-21 17:12:38 +02:00
Christoph Auer
e94d317c02
feat: Add adaptive OCR, factor out treatment of OCR areas and cell filtering (#38)
* Introduce adaptive OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Factor out BaseOcrModel, add docling-parse backend tests, fixes

* Make easyocr default dep

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-20 15:28:03 +02:00
Michele Dolfi
349b0e914f
fix: allow newer torch versions (#34)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-14 13:37:36 +02:00
Michele Dolfi
90dd676422
feat: update parser with bytesio interface and set as new default backend (#32)
* update parser with bytesio interface

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* change default backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update DEFAULT_BACKEND

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-14 12:30:00 +02:00
Michele Dolfi
79ef8d2f2f
fix: update (vuln) deps (#29)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:29:36 +02:00
Maxim Lysak
b8f5e38a8c
feat: introducing docling_backend (#26)
Uses our own docling_parse to reliably get PDF cells
To get page images, this backend uses pypdfium2

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-08-07 16:22:36 +02:00
Panos Vagenas
d2d9543415
fix: set page number using 1-based indexing (#22)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-31 14:28:44 +02:00
Panos Vagenas
d603137383
feat: add simplified single-doc conversion (#20)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-26 16:55:33 +02:00
Michele Dolfi
54b3dda141
fix: add easyocr to main deps for valid extra (#19)
* fix: add easyocr to main deps for valid extra

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove group

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-24 14:11:26 +02:00
Michele Dolfi
b0725e0aa6
fix: expose ocr as extra (#18)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-24 11:14:17 +02:00
Michele Dolfi
7bc20adc16
pin docling-ibm-models 1.1.0 with python 3.10 support (#15)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-18 17:27:48 +02:00
Panos Vagenas
eb0b208272
chore: switch to docling-core Markdown export (#14)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-18 16:10:05 +02:00
Michele Dolfi
fb72688ff7
feat: enable python 3.12 support by updating glm (#8)
* update deepsearch-glm for python 3.12 support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* enable python 3.12 in ci tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-17 14:03:26 +02:00
Christoph Auer
e2d996753b Initial commit 2024-07-15 09:42:42 +02:00