* Experimental code for repetition detection, VLLM Streaming
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update VLLM Streaming
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update VLLM inference code, CLI and VLM specs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix generation and decoder args for HF model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix vllm device args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Bugfixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove streaming VLLM for the moment
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add repetition StoppingCriteria for GraniteDocling/SmolDocling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Make GenerationStopper base class and port for MLX
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add streaming support and custom GenerationStopper support for ApiVlmModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for ApiVlmModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for ApiVlmModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix api_image_request_streaming when GenerationStopper triggers.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move DocTagsRepetitionStopper to utility unit, update examples
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Rich tables support for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated and added tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Refactored parse_table_data in html_backend into few smaller functions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Changing scope of few functions in html_backend.py, making them static, when possible
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* feat: add a backend parser for WebVTT files
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs: update README with VTT support
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs: add description to supported formats
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: upgrade docling-core to unescape WebVTT in markdown
Pin the new release of docling-core 2.48.2.
Do not escape HTML reserved characters when exporting WebVTT documents to markdown.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test: add missing copyright notice
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs: add an example of RAG with OpeanSearch
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: pin latest docling-core and update uv.lock
Pin latest version release of docling-core in pyproject.toml
Update the dependencies in uv.lock file
Run the notebook rag_opensearch.ipynb to pick up changes from docling-core
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* updated the backend and pyproject.toml
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the version and test files
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* forgot to add 1 updated test-file
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* feat: Switch default layout model to DOCLING_LAYOUT_HERON. Update the unit test data.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Use default layout model in model_downloader default args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use default layout model in model_downloader default args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docling-models tag for TableFormer
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT (from linux CPU)
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
* fix: Ensure that the visualisations happen on copies of the page image
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Pinpoint docling-ibm-models to the fix branch for the ReadingOrderPredictor
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Update uv.lock
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Update tests GT to match the Heron layout model and the improved reading order model in Linux
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Introduce the verify_doctags optional parameter in conversion tests to control if a doctags
comparison should take place. Skip doctags comparisons for certain tests.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Generate tests GT on Mac
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Remove the pinning of the docling-ibm-models and use the release 3.9.1
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
* Fix OCR bounding box misalignment caused by rotation metadata
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
* Add rotation-mismatch scanned pdf test case
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
* add ground truth for ocr_test_rotation_mismatch.pdf
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
* add ground truth for ocr_test_rotation_mismatch.pdf
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
* Updated test GT and merged from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix OCR test by excluding mismatched rotation example
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* Add DocumentConverter.extract and full extraction pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add DocumentConverter.extract template arg
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add NuExtract model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add Extraction pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add proper test, support pydantic class types
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add qr bill example
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add base_extraction_pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add types
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update typing of ExtractionResult and inner fields
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Factor out extract to DocumentExtractor
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Address mypy issues
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add DocumentExtractor
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Resolve circular import issue
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports, remove Optional for template arg
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move new type definitions into datamodel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update comments
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Respect page-range, disable test_extraction for CI
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore(html): refactor parser to leverage context managers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): parse inline code snippets, also from list items
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(html): remove hidden tags
Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: exploring new version
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* chore: autoformat
DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
* feat: enable configurable runtime for rapidocr and handle new result better;
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* chore: fix linter
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* chore: use new server model
* chore: change default engine type to onnx
* chore: tests update for new rapidocr
* fix: rebase from main and fix clashes
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 02f9db85f562e5cdfda40c52fee55cfd4030d70a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a7bcb205faedb881f94a89b3bbd29cb31ccd54f0
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a39482a98cbcff7a825c8321134732af0c65930a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63e9d717fa26951566b02761f3fdfc752c31f805
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ef12a6ec1ea2846a8a8e2e776eeaa59c2a0c4dfe
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2222d2340387f8d9d66f3ca9d8e21a0945a44e7a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bc6a1dc507d7f146ec4797a2d3840414f46ac64d
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 56e0d67da7c57d4b5caf8eaef8dff7056c3efd32
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 871ca21271412006c76acf3c19426140efed3d50
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7b1b77159da729d483a581a86c7309acba1712a7
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a792a714a43e19a91b2b782f54621c1c5efda632
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d1fed26323ff829b716bc667fe69532839363e45
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 346ec1cad943765f886e5d17fb0a54221124689c
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4d0bbe5bd6e9f7261b97362ff8823af244267089
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 34a5ad53892a7064a6bf35f890d344d464c78b2f
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 9151959db3ad53535011d1cfdcf9181fdf936bb1
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 8ef5536f2c098826c6c0a05190f8a80614c3f3cb
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7e18637a35
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63fb8ff599
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 0cb9444fb8
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 38940d9978
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: b6d461ac42
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ee55eb3408
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
---------
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>