Ayraf
df140227c3
feat: support xlsm files ( #1520 )
...
* code for xlsm support
* updated support for xlsm
* updated code for xlsm support
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
updated the tests/test_backend_msexcel_xlsm.py:
have a function starting with test
removed all print statements
** To add an explicit assert {test}=={pred}
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update base_models.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update document_converter.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Delete tests/test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* xlsm file
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* run tests
* ran tests
* Fix tests, upgrade XSLM example to a valid file
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 16:55:59 +02:00
Peter W. J. Staar
6613b9e98b
fix: prov for merged-elems ( #1728 )
...
* fix: prov for merged-elems
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Reset pyproject.toml
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 11:22:42 +02:00
Maras Ioannis
e979750ce9
fix(tesseract): initialize df_osd to avoid uninitialized variable error ( #1718 )
...
* fix: initialize df_osd to avoid uninitialized variable error
Signed-off-by: IoannisMaras <maras2002@gmail.com >
* Fix formatting
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Satisfy mypy, regenerate OCR tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: IoannisMaras <maras2002@gmail.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 10:57:45 +02:00
Michele Dolfi
f7f31137f1
fix: allow custom torch_dtype in vlm models ( #1735 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-10 10:52:15 +02:00
Michele Dolfi
49b10e7419
docs: add open webui ( #1734 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-10 09:35:20 +02:00
AndrewTsai0406
9dbcb3d7d4
fix: Improve extraction from textboxes in Word docs ( #1701 )
...
* fix/docx_text_box_extraction
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
* fix/docx_text_box_extraction
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
---------
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
Co-authored-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
2025-06-06 11:37:46 +02:00
Eugene
a2b83fe4ae
fix: Add WEBP to the list of image file extensions ( #1711 )
...
feat: Add WEBP to the list of image file extensions
Signed-off-by: Eugene <fogaprod@gmail.com >
2025-06-05 09:09:27 +02:00
github-actions[bot]
40df0d74ad
chore: bump version to 2.36.1 [skip ci]
v2.36.1
2025-06-04 11:43:13 +00:00
Michele Dolfi
8846f1a393
fix: remove typer and click constraints ( #1707 )
...
release typer and click constraints
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-04 13:06:23 +02:00
Michele Dolfi
be42b03f9b
docs: flash-attn usage and install ( #1706 )
...
* docs: flash-attn usage and install
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix link
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-04 11:09:54 +02:00
github-actions[bot]
96c54dba91
chore: bump version to 2.36.0 [skip ci]
v2.36.0
2025-06-03 13:54:25 +00:00
Michele Dolfi
cdd401847a
feat: simplify dependencies, switch to uv ( #1700 )
...
* refactor with uv
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* constraints for onnxruntime
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* more constraints
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-03 15:18:54 +02:00
Panos Vagenas
61d0d6c755
test: mark flaky test ( #1698 )
...
* test: cleanse Word test file
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* mark textbox file test as flaky
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* fix path usage
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-03 13:13:44 +02:00
Peter W. J. Staar
cfdf4cea25
feat: new vlm-models support ( #1570 )
...
* feat: adding new vlm-models support
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the transformers
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* got microsoft/Phi-4-multimodal-instruct to work
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* working on vlm's
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring the VLM part
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* all working, now serious refacgtoring necessary
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring the download_model
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the formulate_prompt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* pixtral 12b runs via MLX and native transformers
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the VlmPredictionToken
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring minimal_vlm_pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the MyPy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added pipeline_model_specializations file
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* need to get Phi4 working again ...
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* finalising last points for vlms support
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the pipeline for Phi4
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* streamlining all code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixing the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the html backend to the VLM pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the static load_from_doctags
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restore stable imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use AutoModelForVision2Seq for Pixtral and review example (including rename)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove unused value
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* refactor instances of VLM models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* skip compare example in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use lowercase and uppercase only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename pipeline_vlm_model_spec
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move more argument to options and simplify model init
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add supported_devices
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove not-needed function
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* exclude minimal_vlm
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* missing file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add message for transformers version
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename to specs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use module import and remove MLX from non-darwin
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove hf_vlm_model and add extra_generation_args
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use single HF VLM model class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove torch type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add docs for vision models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-02 17:01:06 +02:00
github-actions[bot]
08dcacc5cb
chore: bump version to 2.35.0 [skip ci]
v2.35.0
2025-06-02 12:30:26 +00:00
Edgar Hipp
11ca4f7a7b
docs: fix typo in index.md ( #1676 )
...
Signed-off-by: Edgar Hipp <hipp.edg@gmail.com >
2025-06-02 12:35:59 +02:00
Panos Vagenas
1c8a1283c4
test: ensure utf-8 in test data utils ( #1691 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-02 12:13:19 +02:00
Cesar Berrospi Ramis
984cb137f6
fix: guess HTML content starting with script tag ( #1673 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-02 08:43:24 +02:00
Cesar Berrospi Ramis
3942923125
chore: fix or ignore runtime and deprecation warnings ( #1660 )
...
* chore: fix or catch deprecation warnings
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: update poetry lock with latest docling-core
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-28 17:55:31 +02:00
Panos Vagenas
b3e0042813
chore: exclude data from GH Linguist ( #1671 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-28 15:42:34 +02:00
Cesar Berrospi Ramis
106951e71e
test: add missing ground truth files ( #1667 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-28 13:26:49 +02:00
Peter W. J. Staar
b356b33059
feat: Add visualization of bbox on page with html export. ( #1663 )
...
* feat: Add visualization of bbox on page with html export.
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the cli argument to show_layout
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-05-28 13:10:38 +02:00
DavidLee
51d3450915
fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte ( #1665 )
...
Update document.py
fix: when mime not "application/xml" or "text/plain" raise
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
Signed-off-by: DavidLee <yongsheng_li@foxmail.com >
2025-05-27 14:06:05 +02:00
github-actions[bot]
2579d89510
chore: bump version to 2.34.0 [skip ci]
v2.34.0
2025-05-22 18:44:45 +00:00
Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() ( #1636 )
...
fix ZeroDivisionError for cell_bbox.area()
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-22 13:43:33 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract ( #1167 )
...
* fix(ocr): tesseract support mis-oriented documents
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): update missing test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): rotate image to the natural orientation before layout prediction
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): move bounding bow rotation util to orientation.py
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): refactor rotation utilities
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): update e2e OCR test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): update e2e OCR test data
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`
* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`
* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation
* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`
---------
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
2025-05-21 18:12:33 +02:00
Christoph Auer
90875247e5
feat: Establish confidence estimation for document and pages ( #1313 )
...
* Establish confidence field, propagate layout confidence through
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add OCR confidence and parse confidence (stub)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add parse quality rules, use 5% percentile for overall and parse scores
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Heuristic updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix garbage regex
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Move grade to page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Introduce mean_score and low_score, consistent aggregate computations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add confidence test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-05-21 12:32:49 +02:00
Václav Vančura
14d4f5b109
fix(integration): update the Apify Actor integration ( #1619 )
...
* fix(actor): remove references to missing docling_processor.py
Signed-off-by: Václav Vančura <commit@vancura.dev >
* chore(actor): update Actor README.md with recent repo URL changes
Signed-off-by: Václav Vančura <commit@vancura.dev >
* chore(actor): improve the Actor README.md local header link
Signed-off-by: Václav Vančura <commit@vancura.dev >
* chore(actor): bump the Actor version number
Signed-off-by: Václav Vančura <commit@vancura.dev >
* Update .actor/actor.json
Co-authored-by: Marek Trunkát <marek@trunkat.eu >
Signed-off-by: Jan Čurn <jan.curn@gmail.com >
---------
Signed-off-by: Václav Vančura <commit@vancura.dev >
Signed-off-by: Jan Čurn <jan.curn@gmail.com >
Co-authored-by: Jan Čurn <jan.curn@gmail.com >
Co-authored-by: Marek Trunkát <marek@trunkat.eu >
2025-05-21 02:47:55 +02:00
github-actions[bot]
84d0889829
chore: bump version to 2.33.0 [skip ci]
v2.33.0
2025-05-20 19:54:51 +00:00
MoheyElDin Badr
f4d9d4111b
fix: Fix issue with detecting docx files, and files with upper case extensions ( #1609 )
...
fix detecting files with uppercase extensions
Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com >
2025-05-20 19:42:37 +02:00
Said Gürbüz
0e00a263fa
fix: load_from_doctags static usage ( #1617 )
...
* fix load_from_doctags usage
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* update dependencies
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* fix lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* revert lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* update lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
---------
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-20 15:06:12 +02:00
Krishnan
f2e9c0784c
fix: incorrect force_backend_text behaviour for VLM DocTag pipelines ( #1371 )
...
* Fix force_backend_text
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
* empty commit to retrigger CI
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-20 09:59:38 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com >
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
---------
Signed-off-by: Andrew <tsai247365@gmail.com >
2025-05-19 15:01:36 +02:00
Panos Vagenas
7c4c356e76
chore: fix chunking example data link ( #1596 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-16 08:44:47 +02:00
github-actions[bot]
aeb0716bbb
chore: bump version to 2.32.0 [skip ci]
v2.32.0
2025-05-14 14:28:21 +00:00
Vinay R Damodaran
3a04f2a367
feat: Improve parallelization for remote services API calls ( #1548 )
...
* Provide the option to make remote services call concurrent
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
* Use yield from correctly?
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
* not do amateur hour stuff
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
---------
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
2025-05-14 15:47:55 +02:00
jimkarag02
9f8b479f17
fix(ocr): orig field in TesseractOcrCliModel as str ( #1553 )
...
fix: ensure orig and text are both strings in TesseractOcrCliModel
Signed-off-by: Dimitris Karagatslis <dimo9.dk@gmail.com >
2025-05-14 15:05:52 +02:00
Panos Vagenas
9f28abf061
docs: add advanced chunking & serialization example ( #1589 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-14 14:35:07 +02:00
Alex Sokolov
2efb7a7c06
fix(settings): fix nested settings load via environment variables ( #1551 )
...
Signed-off-by: Alexander Sokolov <alsokoloff@gmail.com >
2025-05-14 13:42:10 +02:00
Elwin
12dab0a1e8
feat: support image/webp file type ( #1415 )
...
* support image/webp file type
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
* docs: add webp image format in supported_formats.md
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
* test: add a test case for `image/webp` file
Signed-off-by: Elwin <hzywong@gmail.com >
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com >
* test: update test case of converting `image/webp` file with more ocr engines
Signed-off-by: Elwin <hzywong@gmail.com >
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com >
* rename test file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-14 09:47:28 +02:00
github-actions[bot]
23238c241f
chore: bump version to 2.31.2 [skip ci]
v2.31.2
2025-05-13 10:09:19 +00:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com >
2025-05-13 11:17:26 +02:00
Michele Dolfi
8baa85a49d
fix: restrict click version and update lock file ( #1582 )
...
* fix click dependency and update lock file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-05-13 10:40:08 +02:00
github-actions[bot]
0d0fa6cbe3
chore: bump version to 2.31.1 [skip ci]
v2.31.1
2025-05-12 09:44:26 +00:00
Michele Dolfi
127e38646f
fix: add smoldocling in download utils ( #1577 )
...
add smoldocling in download utils
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-12 10:48:07 +02:00
Oleg Lavrovsky
844babb390
docs: update links in data_prep_kit ( #1559 )
...
Update data_prep_kit.md
The links were broken, since the repository was renamed. I also noticed that PDF2Parquet is now referred to as Docling2Parquet.
Signed-off-by: Oleg Lavrovsky <31819+loleg@users.noreply.github.com >
2025-05-11 20:38:25 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-09 15:14:32 +02:00
Panos Vagenas
3220a592e7
docs: add serialization docs, update chunking docs ( #1556 )
...
* docs: add serializers docs, update chunking docs
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update notebook to improve MD table rendering
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-08 21:43:01 +02:00
DavidLee
f1658edbad
fix: mime error in document streams ( #1523 )
...
Update document.py
edit got file mime error
Signed-off-by: DavidLee <yongsheng_li@foxmail.com >
2025-05-06 09:30:46 +02:00