* fix: Ensure proper image_scale is used for generated page images in layout+vlm pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Ensure proper image_scale output in default VLM pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Fix p elements having block-level elements anywhere inside as browsers do.
Fix wrong type annotations.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Example on how to apply to Docling Document OCR as a post-processing with "nanonets-ocr2-3b" via LM Studio
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added support of elements with multiple provenances
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* improved prompt for nanonets-ocr2-3b
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* excluded example from CI
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated class name
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Improved usability of the example, added simple cli, and some helper functions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix api_image_request usage
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix pydantic errors
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Improvements and corrections
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added string sanitation, removing break lines from remote OCR, also preserving original text from json
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added quick and reliable detection of empty image crops (elements, table cells, form items), these are not sent to OCR
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Example respects ocr_documents.txt, tuned empty crop detection
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaning api_image_request
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* feat: Scaffolding for layout and table model plugin factory
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add missing files
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add base options classes for layout and table
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(experimental): Add experimental TableCropsLayoutModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add example
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix#2250. list items after numbered headers
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add test for new case
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* chore(docx): remove unnecessary check
Remove 'current_parent is None' check in '_add_list_item' function since it
will always be None.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: Scaffolding for layout and table model plugin factory
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add missing files
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add base options classes for layout and table
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Respect document_timeout in new threaded StandardPdfPipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* add test case to test_options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Make sure unprocessed pages are not getting into assemble_document
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* docs(examples): update the set up of Milvus Lite
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: remove references to deprecated save_as_document_tokens
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: added save_as_json and load_from_json to ConversionResult
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added a test
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the save and load for ConversionResult
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the signature
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* refactored load/save into ConversionAssets
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the DoclingVersion class
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* renamed time_stamp to timestamp
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix: In DocumentConverter.convert_string() make nullable name parameter actually optional
* DCO Remediation Commit for Cristi Burcă <mail@scribu.net>
I, Cristi Burcă <mail@scribu.net>, hereby add my Signed-off-by to this commit: 2b256e3528
Signed-off-by: Cristi Burcă <mail@scribu.net>
---------
Signed-off-by: Cristi Burcă <mail@scribu.net>
* add example processing parquet file of images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* vlm using vllm api
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use openvino and add more docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add default input file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* change default to standard for running in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use simple rapidocr without openvino in the CI example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: add the Image backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the pre-commit
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Fixed single- versus multi-frame image formats
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix: Proper usage of ImageDocumentBackend in the pipeline, deprecate old code.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Adapt tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: correct mets_gbs backend test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Make ImagePageBackend.get_bitmap_rects() yield
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* adding granite-docling preview
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the model specs
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Add Layout+VLM pipeline with prompt injection, ApiVlmModel updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update layout injection, move to experimental
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust defaults
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Map Layout+VLM pipeline to GraniteDoclign
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove base_prompt from layout injection prompt
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Reinstate custom prompt
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* add demo_layout file that produces with vs without layout injection
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* feat: wrap vlm_inference around process_images
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* feat: carry input prompt + number of input tokens
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* fix: adapt example to run on local test file
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* fix: example now expects single document
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* feat: add layout example to EXAMPLES_TO_SKIP
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* feat: address comments on git
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* feat: add inference wrapper for hf_transformers + carry input prompt
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* Feat: add track_input_prompt to ApiVlmOptions, and track input prompt as part of api vlm
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* fix: Ensure backward-compatible build_prompt by adding _internal_page ag
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Ensure backward-compatible build_prompt by adding _internal_page ag
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for demo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Typing fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Restoring lost changes in vllm_model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Restoring vlm_pipeline_api_model example
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: ElHachem02 <peterelhachem02@gmail.com>
* fix(docx): parse page headers and footers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): rename _add_header with _add_heading
To avoid confusion, rename _add_header function name with _add_heading
since the function is about adding section headings.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): extend the page header and footer parsing to any content type
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): fix _add_header_footer function
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): remove unnecessary import
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): simplify parsing of simple tables
Simplify the parsing of tables with just text (no rich cells).
Move nested function group_cell_elements out of _handle_tables for readability.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): reuse method for finding inline pictures
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): format strikethrough text
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(docx): use fixtures to avoid converting same file multiple times
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(docx): remove unnecessary argument docx_obj in functions
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(docx): add test for rich table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): small improvements in backend and its unit tests
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(docx): parse superscript and subscript formatted text
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): simplify parsing of simple table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(html): add test for rich table cells
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): ensure table cells with formatted text are parsed as RichTableCell
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): simplify process_rich_table_cells since only rich cells are processed
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): formatted cell runs should be parsed as text items respecting the order
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: pin latest docling-core and update uv.lock
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: upgrade dependencies on uv.lock
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add enum StopReason and use it in VlmPrediction
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* add vlm_inference time for api calls and track stop reason
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* fix: rename enum to VlmStopReason
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* Propagate partial success status if page reaches max tokens
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
* feat: page with generation stopped by loop detector create partial success status
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
* Add hint for future improvement
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
* fix: remove vlm_stop_reason from extracted page data, add UNSPECIFIED state as VlmStopReason to avoid null value
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
---------
Signed-off-by: ElHachem02 <peterelhachem02@gmail.com>
Signed-off-by: Peter El Hachem <peter.el.hachem@ibm.com>
Co-authored-by: Peter El Hachem <peter.el.hachem@ibm.com>
* fix(ocr): use PSM integer values directly instead of constructor
- Use integer psm value directly instead of calling tesserocr.PSM()
- Fixed in both main_psm and script_readers initialization
- tesserocr.PSM is a class with integer constants, not an enum
Fixes#2576
* DCO Remediation Commit for mulgyeol <mulgyeoljung@gmail.com>
I, mulgyeol <mulgyeoljung@gmail.com>, hereby add my Signed-off-by to this commit: da63a17a3c
Signed-off-by: mulgyeol <mulgyeoljung@gmail.com>
---------
Signed-off-by: mulgyeol <mulgyeoljung@gmail.com>