Richard (Huangrui) Chu
b66624bfff
fix(xlsx): speed up by detecting the true last non-empty row/column ( #2404 )
...
* Update msexcel_backend.py
Fix #2307 , Follow the instruction of https://github.com/docling-project/docling/issues/2307#issuecomment-3327248503 .
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com >
* Update msexcel_backend.py
Fix error
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com >
* Fix linting issues
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com >
* Add test files and data (Signed-off-by: Huangrui Chu <huangrui.chu.1999@gmail.com >)
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com >
* resolve conflict with test_backend_msexecl; update the boundary
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com >
* chore(xlsx): use a dataclass to represent a bounding rectangle in worksheets
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* chore(xlsx): increase parsing speed by iterating on 'sheet._cells'
Increase the parsing speed of the spreadsheet backend by iterating on 'sheets._cells'
since this is proportional to the number of created cells.
Rename test file to align it to other test files.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
---------
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com >
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
2025-10-21 08:08:20 +02:00
Cesar Berrospi Ramis
cce18b2ff7
fix: deal with chartsheets in workbooks ( #2433 )
...
* fix(xlsx): deal with chartsheets in workbooks
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* tests(xlsx): align test file names
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
2025-10-10 15:06:38 +02:00
Qiefan Jiang
a283ccff25
feat(msexcel): set ContentLayer.INVISIBLE for invisible sheet ( #1876 )
...
* feat(msexcel): ignore invisible sheet
* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com >
I, Qiefan Jiang <jiangqiefan@bytedance.com >, hereby add my Signed-off-by to this commit: ca391f4908f44f301de54a97057f0b809f5ce66c
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* retain invisible sheet with ContentLayer.INVISIBLE
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* update UT
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* fix: use Optional for python3.9
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com >
I, Qiefan Jiang <jiangqiefan@bytedance.com >, hereby add my Signed-off-by to this commit: a34371a90e
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
---------
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
2025-09-01 13:53:45 +02:00
Ayraf
df140227c3
feat: support xlsm files ( #1520 )
...
* code for xlsm support
* updated support for xlsm
* updated code for xlsm support
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
updated the tests/test_backend_msexcel_xlsm.py:
have a function starting with test
removed all print statements
** To add an explicit assert {test}=={pred}
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update base_models.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update document_converter.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Delete tests/test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* xlsm file
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* run tests
* ran tests
* Fix tests, upgrade XSLM example to a valid file
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 16:55:59 +02:00
Peter W. J. Staar
a458e298ca
fix: added extraction of byte-images in excel ( #804 )
...
* fix(msexcel): ignore Mypy checking for _find_images_in_sheet function
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local >
* fixed some issues
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* pinned pillow in pyproject
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local >
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-01-24 18:48:02 +01:00
Peter W. J. Staar
926dfd29d5
feat: added excel backend ( #334 )
...
* feat: added excel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first msexcel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added tooling for the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working version for excel parsing of tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactor EXCEL to XLSX
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the unit tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* ran poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* adding images to output [WIP]
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the msexcel
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the msexcel (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added tests for merged cells in excel
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2024-11-19 12:21:17 +01:00