Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors.
Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com>
Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com>
* docs: Introduce example with custom models for RapidOCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Exclude the example with custom RapidOCR models from the examples to run in github actions
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Processing of placeholder shapes in pptx that have text but no bbox
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Support of table of content containers in docx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* test: update results with new docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update all deps in the lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix table in test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix version for python3.13
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* latest poetry version in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* activate py3.13 in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update docs about python 3.13
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* test with rapidocr only on python <3.13
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated faq
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* updated the readme with upcoming features
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the docs-index
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* chore: Add example for inspection of picture content
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Test case re-generation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Test case re-generation only on CPU
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Add missing GT files
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Filetype library may not identify some files as PDF. Leverage the file extension
as a simple solution.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>