Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support ( #630 )
...
Run Docs CD / build-deploy-docs (push) Failing after 1m27s
Run Docs CI / build-docs (push) Failing after 52s
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima
6eb718f849
feat: equations to latex in MSWord backend (with inline groups) ( #1114 )
...
* Equation groups
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Proper handling of orphan IDs in layout postprocessing (#1118 )
* Fix the handling of orphan IDs in layout postprocessing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: bump version to 2.25.2 [skip ci]
* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124 )
add env var in docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(CLI): fix help message for abort options (#1130 )
fix help message
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* perf: New revision code formula model and document picture classifier (#1140 )
* new version code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new version document picture classifier
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* restored original code formula test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
---------
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: Use new TableFormer model weights and default to accurate model version (#1100 )
* feat: New tableformer model weights [WIP]
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Updated TF version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests, after merging with Main, Switched to Accurate TF model by default
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: bump version to 2.26.0 [skip ci]
* fix: Pass tests, update docling-core to 2.22.0 (#1150 )
fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Updating content hash
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis
0cd81a8122
fix(docx): merged table cells not properly converted ( #857 )
...
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-03 10:20:03 +01:00
Maxim Lysak
2c037ae62e
fix: Fixed docx import with headers that are also lists ( #842 )
...
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-31 10:51:21 +01:00
Maxim Lysak
d0a1180478
fix: Fixes for wordx ( #432 )
...
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-26 14:44:43 +01:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend ( #314 )
...
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-12 15:20:55 +01:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
...
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restructure title fix (#187 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-30 13:14:56 +01:00