* add table raw cells when no table structure model was used
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Add RichTableCell instance for tables with missing structure.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* Experimental code for repetition detection, VLLM Streaming
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update VLLM Streaming
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update VLLM inference code, CLI and VLM specs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix generation and decoder args for HF model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix vllm device args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Bugfixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove streaming VLLM for the moment
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add repetition StoppingCriteria for GraniteDocling/SmolDocling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Make GenerationStopper base class and port for MLX
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add streaming support and custom GenerationStopper support for ApiVlmModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for ApiVlmModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for ApiVlmModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix api_image_request_streaming when GenerationStopper triggers.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move DocTagsRepetitionStopper to utility unit, update examples
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Rich tables support for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated and added tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Refactored parse_table_data in html_backend into few smaller functions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Changing scope of few functions in html_backend.py, making them static, when possible
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* feat: add a backend parser for WebVTT files
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs: update README with VTT support
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs: add description to supported formats
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: upgrade docling-core to unescape WebVTT in markdown
Pin the new release of docling-core 2.48.2.
Do not escape HTML reserved characters when exporting WebVTT documents to markdown.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test: add missing copyright notice
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* updated the backend and pyproject.toml
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the version and test files
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* forgot to add 1 updated test-file
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* feat: Switch default layout model to DOCLING_LAYOUT_HERON. Update the unit test data.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Use default layout model in model_downloader default args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use default layout model in model_downloader default args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docling-models tag for TableFormer
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT (from linux CPU)
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
* fix: Ensure that the visualisations happen on copies of the page image
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Pinpoint docling-ibm-models to the fix branch for the ReadingOrderPredictor
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Update uv.lock
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Update tests GT to match the Heron layout model and the improved reading order model in Linux
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Introduce the verify_doctags optional parameter in conversion tests to control if a doctags
comparison should take place. Skip doctags comparisons for certain tests.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Generate tests GT on Mac
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Remove the pinning of the docling-ibm-models and use the release 3.9.1
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
* Fix OCR bounding box misalignment caused by rotation metadata
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
* Add rotation-mismatch scanned pdf test case
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
* add ground truth for ocr_test_rotation_mismatch.pdf
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
* add ground truth for ocr_test_rotation_mismatch.pdf
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
* Updated test GT and merged from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix OCR test by excluding mismatched rotation example
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* Add DocumentConverter.extract and full extraction pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add DocumentConverter.extract template arg
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add NuExtract model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add Extraction pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add proper test, support pydantic class types
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add qr bill example
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add base_extraction_pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add types
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update typing of ExtractionResult and inner fields
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Factor out extract to DocumentExtractor
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Address mypy issues
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add DocumentExtractor
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Resolve circular import issue
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports, remove Optional for template arg
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move new type definitions into datamodel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update comments
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Respect page-range, disable test_extraction for CI
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore(html): refactor parser to leverage context managers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): parse inline code snippets, also from list items
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(html): remove hidden tags
Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: exploring new version
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* chore: autoformat
DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
* feat: enable configurable runtime for rapidocr and handle new result better;
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* chore: fix linter
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* chore: use new server model
* chore: change default engine type to onnx
* chore: tests update for new rapidocr
* fix: rebase from main and fix clashes
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 5815c8f81b0e5ce400332597b6795e5a97ecf775
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 02f9db85f562e5cdfda40c52fee55cfd4030d70a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a7bcb205faedb881f94a89b3bbd29cb31ccd54f0
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a39482a98cbcff7a825c8321134732af0c65930a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63e9d717fa26951566b02761f3fdfc752c31f805
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ef12a6ec1ea2846a8a8e2e776eeaa59c2a0c4dfe
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 2222d2340387f8d9d66f3ca9d8e21a0945a44e7a
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: bc6a1dc507d7f146ec4797a2d3840414f46ac64d
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 56e0d67da7c57d4b5caf8eaef8dff7056c3efd32
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 871ca21271412006c76acf3c19426140efed3d50
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7b1b77159da729d483a581a86c7309acba1712a7
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: a792a714a43e19a91b2b782f54621c1c5efda632
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: d1fed26323ff829b716bc667fe69532839363e45
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 346ec1cad943765f886e5d17fb0a54221124689c
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 4d0bbe5bd6e9f7261b97362ff8823af244267089
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 34a5ad53892a7064a6bf35f890d344d464c78b2f
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 9151959db3ad53535011d1cfdcf9181fdf936bb1
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 8ef5536f2c098826c6c0a05190f8a80614c3f3cb
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* DCO Remediation Commit for Georg Heiler <georg.kf.heiler@gmail.com>
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 7e18637a35
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 63fb8ff599
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 0cb9444fb8
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: 38940d9978
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: b6d461ac42
I, Georg Heiler <georg.kf.heiler@gmail.com>, hereby add my Signed-off-by to this commit: ee55eb3408
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
---------
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
* fix(HTML): parse footer tag as a section in furniture
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): add test for body vs furniture in HTML parser.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* re-implement links for html backend.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* implement hack for images.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* feat: add convert_string to document-converter
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix unsupported operand type(s) for |: type and NoneType
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added tests for convert_string
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
fix(HTML): ensure correct concatenation of child strings in table cells and list items
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Update tests to use default PDF backend (DPv4)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* OCR tests use DPv1 until rotation bugs are fixed
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Establish layout_model spec and example instantations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated naming
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Back to uppercase constants
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix deps issue with openai-whipser>numba>llvmlite
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pull v1 changed test GT from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Integrate ListItemMarkerProcessor into document assembly
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update to final version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update all test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade deps
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* fix: extraneous empty paragraphs for test files
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
---------
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>
When page_range param is used for formula conversion,
the system throws list index out of range error.
Included tests to validate that the fix works.
Signed-off-by: Masum <masumsofts@yahoo.com>