Ayraf
df140227c3
feat: support xlsm files ( #1520 )
...
* code for xlsm support
* updated support for xlsm
* updated code for xlsm support
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
updated the tests/test_backend_msexcel_xlsm.py:
have a function starting with test
removed all print statements
** To add an explicit assert {test}=={pred}
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update base_models.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update document_converter.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Delete tests/test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* xlsm file
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* run tests
* ran tests
* Fix tests, upgrade XSLM example to a valid file
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 16:55:59 +02:00
Peter W. J. Staar
6613b9e98b
fix: prov for merged-elems ( #1728 )
...
* fix: prov for merged-elems
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Reset pyproject.toml
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 11:22:42 +02:00
Maras Ioannis
e979750ce9
fix(tesseract): initialize df_osd to avoid uninitialized variable error ( #1718 )
...
* fix: initialize df_osd to avoid uninitialized variable error
Signed-off-by: IoannisMaras <maras2002@gmail.com >
* Fix formatting
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Satisfy mypy, regenerate OCR tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: IoannisMaras <maras2002@gmail.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 10:57:45 +02:00
Michele Dolfi
f7f31137f1
fix: allow custom torch_dtype in vlm models ( #1735 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-10 10:52:15 +02:00
AndrewTsai0406
9dbcb3d7d4
fix: Improve extraction from textboxes in Word docs ( #1701 )
...
* fix/docx_text_box_extraction
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
* fix/docx_text_box_extraction
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
---------
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
Co-authored-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
2025-06-06 11:37:46 +02:00
Eugene
a2b83fe4ae
fix: Add WEBP to the list of image file extensions ( #1711 )
...
feat: Add WEBP to the list of image file extensions
Signed-off-by: Eugene <fogaprod@gmail.com >
2025-06-05 09:09:27 +02:00
Peter W. J. Staar
cfdf4cea25
feat: new vlm-models support ( #1570 )
...
* feat: adding new vlm-models support
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the transformers
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* got microsoft/Phi-4-multimodal-instruct to work
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* working on vlm's
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring the VLM part
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* all working, now serious refacgtoring necessary
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring the download_model
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the formulate_prompt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* pixtral 12b runs via MLX and native transformers
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the VlmPredictionToken
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring minimal_vlm_pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the MyPy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added pipeline_model_specializations file
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* need to get Phi4 working again ...
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* finalising last points for vlms support
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the pipeline for Phi4
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* streamlining all code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixing the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the html backend to the VLM pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the static load_from_doctags
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restore stable imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use AutoModelForVision2Seq for Pixtral and review example (including rename)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove unused value
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* refactor instances of VLM models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* skip compare example in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use lowercase and uppercase only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename pipeline_vlm_model_spec
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move more argument to options and simplify model init
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add supported_devices
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove not-needed function
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* exclude minimal_vlm
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* missing file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add message for transformers version
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename to specs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use module import and remove MLX from non-darwin
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove hf_vlm_model and add extra_generation_args
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use single HF VLM model class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove torch type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add docs for vision models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-02 17:01:06 +02:00
Cesar Berrospi Ramis
984cb137f6
fix: guess HTML content starting with script tag ( #1673 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-02 08:43:24 +02:00
Cesar Berrospi Ramis
3942923125
chore: fix or ignore runtime and deprecation warnings ( #1660 )
...
* chore: fix or catch deprecation warnings
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: update poetry lock with latest docling-core
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-28 17:55:31 +02:00
Peter W. J. Staar
b356b33059
feat: Add visualization of bbox on page with html export. ( #1663 )
...
* feat: Add visualization of bbox on page with html export.
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the cli argument to show_layout
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-05-28 13:10:38 +02:00
DavidLee
51d3450915
fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte ( #1665 )
...
Update document.py
fix: when mime not "application/xml" or "text/plain" raise
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
Signed-off-by: DavidLee <yongsheng_li@foxmail.com >
2025-05-27 14:06:05 +02:00
Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() ( #1636 )
...
fix ZeroDivisionError for cell_bbox.area()
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-22 13:43:33 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract ( #1167 )
...
* fix(ocr): tesseract support mis-oriented documents
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): update missing test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): rotate image to the natural orientation before layout prediction
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): move bounding bow rotation util to orientation.py
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): refactor rotation utilities
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): update e2e OCR test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): update e2e OCR test data
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`
* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`
* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation
* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`
---------
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
2025-05-21 18:12:33 +02:00
Christoph Auer
90875247e5
feat: Establish confidence estimation for document and pages ( #1313 )
...
* Establish confidence field, propagate layout confidence through
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add OCR confidence and parse confidence (stub)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add parse quality rules, use 5% percentile for overall and parse scores
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Heuristic updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix garbage regex
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Move grade to page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Introduce mean_score and low_score, consistent aggregate computations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add confidence test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-05-21 12:32:49 +02:00
MoheyElDin Badr
f4d9d4111b
fix: Fix issue with detecting docx files, and files with upper case extensions ( #1609 )
...
fix detecting files with uppercase extensions
Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com >
2025-05-20 19:42:37 +02:00
Said Gürbüz
0e00a263fa
fix: load_from_doctags static usage ( #1617 )
...
* fix load_from_doctags usage
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* update dependencies
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* fix lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* revert lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* update lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
---------
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-20 15:06:12 +02:00
Krishnan
f2e9c0784c
fix: incorrect force_backend_text behaviour for VLM DocTag pipelines ( #1371 )
...
* Fix force_backend_text
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
* empty commit to retrigger CI
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-20 09:59:38 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com >
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
---------
Signed-off-by: Andrew <tsai247365@gmail.com >
2025-05-19 15:01:36 +02:00
Vinay R Damodaran
3a04f2a367
feat: Improve parallelization for remote services API calls ( #1548 )
...
* Provide the option to make remote services call concurrent
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
* Use yield from correctly?
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
* not do amateur hour stuff
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
---------
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
2025-05-14 15:47:55 +02:00
jimkarag02
9f8b479f17
fix(ocr): orig field in TesseractOcrCliModel as str ( #1553 )
...
fix: ensure orig and text are both strings in TesseractOcrCliModel
Signed-off-by: Dimitris Karagatslis <dimo9.dk@gmail.com >
2025-05-14 15:05:52 +02:00
Alex Sokolov
2efb7a7c06
fix(settings): fix nested settings load via environment variables ( #1551 )
...
Signed-off-by: Alexander Sokolov <alsokoloff@gmail.com >
2025-05-14 13:42:10 +02:00
Elwin
12dab0a1e8
feat: support image/webp file type ( #1415 )
...
* support image/webp file type
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
* docs: add webp image format in supported_formats.md
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
* test: add a test case for `image/webp` file
Signed-off-by: Elwin <hzywong@gmail.com >
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com >
* test: update test case of converting `image/webp` file with more ocr engines
Signed-off-by: Elwin <hzywong@gmail.com >
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com >
* rename test file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-14 09:47:28 +02:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com >
2025-05-13 11:17:26 +02:00
Michele Dolfi
127e38646f
fix: add smoldocling in download utils ( #1577 )
...
add smoldocling in download utils
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-12 10:48:07 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-09 15:14:32 +02:00
DavidLee
f1658edbad
fix: mime error in document streams ( #1523 )
...
Update document.py
edit got file mime error
Signed-off-by: DavidLee <yongsheng_li@foxmail.com >
2025-05-06 09:30:46 +02:00
Michele Dolfi
7c705739f9
fix: usage of hashlib for FIPS ( #1512 )
...
fix usage of hashlib for FIPS
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-02 15:03:29 +02:00
Ihar Hrachyshka
b147331f2a
chore: restore typing hint for self.script_readers ( #1500 )
...
With future annotations, typing hints resolution is always deferred.
https://peps.python.org/pep-0563/
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com >
2025-04-30 20:33:27 +02:00
Ben Browning
4ab7e9ddfb
fix: Guard against attribute errors in TesseractOcrModel __del__ ( #1494 )
...
This moves the initialization of the `reader` and `script_readers`
attributes to before we attempt to import tesserocr, so that when later
accessing these attributes in the garbage collection method `__del__`
the attributes exist.
This requires changing the typing of the `script_readers` dict value to
`Any` because we cannot yet reference its actual strong type, since it's
a tesserocr value.
This prevents throwing an exception during garbage collection for
cases where the TesseractOcrModel instance didn't properly initialize,
like when it throws an `ImportError` during its initializer.
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2025-04-30 17:51:33 +02:00
Zach Cox
cc453961a9
fix: enable cuda_use_flash_attention2 for PictureDescriptionVlmModel ( #1496 )
...
fix: enable use_cuda_flash_attention2 for PictureDescriptionVlmModel
Signed-off-by: Zach Cox <zach.s.cox@gmail.com >
2025-04-30 08:02:52 +02:00
Peter W. J. Staar
976e92e289
fix: updated the time-recorder label for reading order ( #1490 )
...
* fix: updated the time-recorder label for reading order
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-04-29 13:02:53 +02:00
nkh0472
a097ccd8d5
chore: typo fix ( #1465 )
...
* typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
---------
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
2025-04-28 08:52:09 +02:00
Maxim Lysak
94d66a0765
fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False ( #1459 )
...
fixing double scaling in case of do_cell_matching is False
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-25 12:34:12 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags ( #1436 )
...
* fix(html): handle 'address' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(html): handle 'details' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-23 09:30:59 +02:00
Eugene
8012a3e4d6
fix: Treat overflowing -v flags as DEBUG ( #1419 )
...
Signed-off-by: Eugene <fogaprod@gmail.com >
2025-04-19 11:02:41 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff ( #1383 )
...
* add coverage calculation and push
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* new codecov version and usage of token
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* enable ruff formatter instead of black and isort
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff lint fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff unsafe fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add removed imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* runs 1 on linter issues
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* finalize linter fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update pyproject.toml
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-14 18:01:26 +02:00
Peter W. J. Staar
c0ba88edf1
feat(cli): add option for html with split-page mode ( #1355 )
...
* updated the cli to output html in split-page mode
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* add pin for new docling-core with html split argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* relock with fixed html export in docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update more tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update lock with docling-core fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add again chunking extras
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 08:41:50 +02:00
Tim Kellogg
0de70e7991
fix: auto-recognize .xlsx, .docx and .pptx files ( #1340 )
...
* bug: auto-recognize .xlsx files
Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com >
* apply styling
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply to other ms office zip formats
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 07:45:13 +02:00
Cesar Berrospi Ramis
415b877984
fix(docx): declare image_data variable when handling pictures ( #1359 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 13:04:00 +02:00
Rowan Skewes
250399948d
fix: Implement PictureDescriptionApiOptions.bitmap_area_threshold ( #1248 )
...
fix: Implement PictureDescriptionApiOptions.picture_area_threshold
Signed-off-by: Rowan Skewes <rowan.skewes@gmail.com >
2025-04-11 11:14:05 +02:00
Cesar Berrospi Ramis
eef2bdea77
feat(xlsx): create a page for each worksheet in XLSX backend ( #1332 )
...
* sytle(xlsx): enforce type hints in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* feat(xlsx): create a page for each worksheet in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs(xlsx): add docstrings to XLSX backend module.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docling(xlsx): add bounding boxes and page size information in cell units
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 10:29:53 +02:00
Gabe Goodhart
c605edd8e9
feat: OllamaVlmModel for Granite Vision 3.2 ( #1337 )
...
* build: Add ollama sdk dependency
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add option plumbing for OllamaVlmOptions in pipeline_options
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Full implementation of OllamaVlmModel
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Connect "granite_vision_ollama" pipeline option to CLI
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* Revert "build: Add ollama sdk dependency"
After consideration, we're going to use the generic OpenAI API instead
of the Ollama-specific API to avoid duplicate work.
This reverts commit bc6b366468cdd66b52540aac9c7d8b584ab48ad0.
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Move OpenAI API call logic into utils.utils
This will allow reuse of this logic in a generic VLM model
NOTE: There is a subtle change here in the ordering of the text prompt and
the image in the call to the OpenAI API. When run against Ollama, this
ordering makes a big difference. If the prompt comes before the image, the
result is terse and not usable whereas the prompt coming after the image
works as expected and matches the non-OpenAI chat API.
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Refactor from Ollama SDK to generic OpenAI API
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Linting, formatting, and bug fixes
The one bug fix was in the timeout arg to openai_image_request. Otherwise,
this is all style changes to get MyPy and black passing cleanly.
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* remove model from download enum
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* generalize input args for other API providers
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename and refactor
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* require flag for remote services
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* disable example from CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add examples to docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-10 18:03:04 +02:00
Joan Fabrégat
6b696b504a
fix: Properly address page in pipeline _assemble_document when page_range is provided ( #1334 )
...
* Fixes #1333
Signed-off-by: Joan Fabrégat <j@fabreg.at >
* fix for the (dumb) MyPy type checker
Signed-off-by: Joan Fabrégat <j@fabreg.at >
---------
Signed-off-by: Joan Fabrégat <j@fabreg.at >
2025-04-10 16:11:28 +02:00
Maxim Lysak
355d8dc7a6
chore: Logo parameter in docling CLI, prints cute ascii logo ( #1294 )
...
logo parameter in docling cli, prints cute ascii logo
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-09 05:29:48 +02:00
Rafael Teixeira de Lima
14e9c0ce9a
fix(docx): Adding new latex symbols, simplifying how equations are added to text ( #1295 )
...
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(docx): Improve text parsing (#1268 )
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add visual grounding example (#1270 )
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat(docx): add text formatting and hyperlink support (#630 )
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(pptx): check if picture shape has an image attached (#1316 )
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: update lock file (#1315 )
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add plugins docs (#1319 )
add plugin docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: handle <code> tags as code blocks (#1320 )
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com >
2025-04-08 17:11:37 +02:00
Fernando Santos
0499cd1c1e
feat: handle <code> tags as code blocks ( #1320 )
...
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
2025-04-08 10:32:06 +02:00
Maxim Lysak
dc3bf9ceac
fix(pptx): check if picture shape has an image attached ( #1316 )
...
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-07 17:36:56 +02:00
Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support ( #630 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 52s
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima
d2d68747f9
fix(docx): Improve text parsing ( #1268 )
...
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
2025-04-02 12:56:44 +02:00