Commit Graph

55 Commits

Author SHA1 Message Date
Christoph Auer
497ddb34a8 Big refactoring for legacy_document support
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-14 16:36:11 +02:00
Michele Dolfi
ddb509628e use do_ flag in pipeline_options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-13 16:54:46 +02:00
Michele Dolfi
c1ed447c21 propagate raises, add enrichment model, some renaming
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-13 16:03:19 +02:00
Michele Dolfi
7f10a546d3 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 17:04:01 +02:00
Michele Dolfi
98f1a4597e rename and refactor *model*
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 16:57:40 +02:00
Christoph Auer
6efcf0a5a5 Add image format support to PdfBackend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 16:47:15 +02:00
Christoph Auer
d0fccb9342 Merge from simplify-conv-api
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 15:57:08 +02:00
Christoph Auer
95c1f80087 Change code to use unordered/ordered list, robustifications
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 14:53:38 +02:00
Panos Vagenas
136f16e85a
feat!: simplify conversion API (#139)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-11 14:52:37 +02:00
Christoph Auer
3ee97c42b2 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-10-11 12:57:56 +02:00
Christoph Auer
52713f0cf5 Optionally produce legacy_doc
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 12:57:47 +02:00
Michele Dolfi
331ab36f04 Merge remote-tracking branch 'origin/main' into cau/input-format-abstraction
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 11:23:04 +02:00
Christoph Auer
025983f07b Backend error handling fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 11:18:47 +02:00
Christoph Auer
304d16029a More renaming, design enrichment interface
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-11 10:21:31 +02:00
Michele Dolfi
051beae203 use new interface in minimal example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 08:30:09 +02:00
Michele Dolfi
3794f8245e add example PNG
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 18:29:26 +02:00
Christoph Auer
7cad290ceb Refactor test data, legacy usage and more
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 13:54:44 +02:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension (#135)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Christoph Auer
b5a27386c1 Merge from main, update OCR model and test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-09 16:04:19 +02:00
Christoph Auer
0dfbd0b6fc Update examples and test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-09 15:20:27 +02:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines (#118)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2024-10-08 19:07:08 +02:00
Christoph Auer
080042d06d Merge from upstream
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 16:40:55 +02:00
Christoph Auer
203cf19b1b Lots of improvements
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 16:38:42 +02:00
Maxim Lysak
07d952acf9 Improved backends
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-08 16:37:47 +02:00
Christoph Auer
c0447206af Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 14:42:33 +02:00
Maxim Lysak
89e58ca730 Added HTML backend implementation, few improvements for other backends
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-08 11:14:44 +02:00
Maxim Lysak
f773d8a621 Improved demo code, that saves output mds to files
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-07 17:25:17 +02:00
Maxim Lysak
bea9fc22af Added mspowerpoint backend first implementation, improvements on msword backend
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-07 14:55:21 +02:00
Maxim Lysak
cefc34e8d8 Working on a first version of DOCX native backend
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-04 18:19:40 +02:00
Rui Dias Gomes
dde0aff8bd
update examples (#123)
Signed-off-by: rmdg88 <rmdg88@gmail.com>
2024-10-03 14:28:25 +02:00
Christoph Auer
1fa7cd9855 Fundamental refactoring for multi-format support
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-01 16:54:09 +02:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice (#90)
* Support tableformer model choice

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update datamodel structure

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test unit for table options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Ensure import backwards-compatibility for PipelineOptions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update README

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust parameters on custom_convert

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-26 21:37:08 +02:00
Christoph Auer
ba9d115f64 Examples: Don't export experimental output by default
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-25 15:56:29 +02:00
Christoph Auer
33373ac0dd Switch everything to use label enum, and more
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-24 16:00:39 +02:00
Christoph Auer
867e06f9f2 Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-24 12:05:17 +02:00
Panos Vagenas
f555815343
chore: add RAG notebook titles (#101)
[skip ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-24 09:17:46 +02:00
Christoph Auer
12477c8cac Lots of import refactoring
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-23 12:22:49 +02:00
Christoph Auer
ac51a09065 Put stub for experimental format export
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-20 11:08:30 +02:00
Michele Dolfi
f19bd43798
feat: add table exports (#86)
* feat: expose docling-core table exporters and add examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove temp internal implementation of html export

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin latest docling-core 1.4.0 with table exports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-18 08:44:13 +02:00
Panos Vagenas
53569a1023
docs: showcase RAG with LlamaIndex and LangChain (#71)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-11 15:07:08 +02:00
Peter W. J. Staar
bdfdfbf092
feat: adding txt and doctags output (#68)
* feat: adding txt and doctags output

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaned up the export

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix datamodel usage for Figure

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* updated all the examples to deal with new rendering

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-10 17:30:52 +02:00
Michele Dolfi
1de2e4f924
feat: export document pages as multimodal output (#54)
* feat: export document pages as multimodal output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* create a single parquet output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add loading into HF datasets library

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cleanup

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-03 15:05:35 +02:00
Peter W. J. Staar
48f4d1ba52
fix: Add unit tests (#51)
* add the pytests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* renamed the test folder and added the toplevel test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the toplevel function test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* need to start running all tests successfully

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the reference converted documents

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added first test for json and md output

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* replaced deprecated json function with model_dump_json

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* replaced deprecated json function with model_dump_json

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix backend tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* commented out the drawing

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ci: avoid duplicate runs

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* commented out json verification for now

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added verification of input cells

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformat code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages (3)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* run all examples in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* make sure examples return failures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* raise a failure if examples fail

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* run examples after tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add tests and update top_level_tests using only datamodels

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unnecessary code

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Validate conversion status on e2e test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* package verify utils and add more tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* reduce docs in example, since they are already in the tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip batch_convert

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-parse 1.1.2

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* updated the error messages

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* commented out the json verification for now

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* bumped GLM version

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin new docling-parse v1.1.3

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 14:08:20 +02:00
Panos Vagenas
e46a66a176
fix: refine conversion result (#52)
- fields `output` & `assembled` need not be optional
- introduced "synonym" `ConversionResult` for `ConvertedDocument` & deprecated the latter

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-08-27 11:50:43 +02:00
Michele Dolfi
8cc147bc56
fix: align output formats (#49)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-26 13:30:26 +02:00
Christoph Auer
a294b7e64a
feat: Page-level error reporting from PDF backend, introduce PARTIAL_SUCCESS status (#47)
* Put safety-checks for failed parse of pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce page-level error checks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump to docling-parse 1.1.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce page-level error checks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-23 16:18:41 +02:00
Christoph Auer
a8c6b29a67
feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing (#44)
* Use docling-parse page-by-page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Propagate document_hash to PDF backends, use docling-parse 1.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* repin after more packages on pypi

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 13:49:37 +02:00
Michele Dolfi
78347bf679
feat: allow computing page images on-demand with scale and cache them (#36)
* feat: allow computing page images on-demand and cache them

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* feat: expose scale for export of page images and document elements

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix comment

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-20 13:27:19 +02:00
Christoph Auer
c253dd743a
Add redbooks to test data, small additions (#35)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-20 12:36:00 +02:00
Michele Dolfi
63d80edca2
feat: output page images and extracted bbox (#31)
* Add assemble options and example saving pages and figures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add options for different page elements, improve example and flip name of assemble_options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-12 18:25:45 +02:00