Commit Graph

158 Commits

Author SHA1 Message Date
Christoph Auer
92b461b9ab Merge branch 'main' of github.com:DS4SD/docling into dev/test_dt_cleanup 2025-02-17 11:36:09 +01:00
Ahmed Nassar
77eb77bdc2
feat: Support cuda:n GPU device allocation (#694)
* Adding multi-gpu support, and cuda device allocation

Signed-off-by: ahn <ahn@zurich.ibm.com>

* Fixes pydantic exception with cuda:n
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Pydantic field validator and comment restored.

Signed-off-by: ahn <ahn@zurich.ibm.com>

* chore: Accept AcceleratorDevice enum type

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Resetted some options to default, removed EasyOCR model wrap.
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Fixed rebased issues
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Revert accelerator test options
Signed-off-by: ahn <ahn@zurich.ibm.com>

---------

Signed-off-by: ahn <ahn@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: ahn <ahn@sonny.zuvela.ibm.com>
Co-authored-by: ahn <ahn@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 11:31:13 +01:00
Cesar Berrospi Ramis
428b656793
feat(xml-jats): parse XML JATS documents (#967)
* chore(xml-jats): separate authors and affiliations

In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(xml-jats): replace new line character by a space

Instead of removing new line character from text, replace it by a space character.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* feat(xml-jats): improve existing parser and extend features

Partially support lists, respect reading order, parse more sections, support equations, better text formatting.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore(xml-jats): rename PubMed objects to JATS

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-17 10:43:31 +01:00
Christoph Auer
3486fbe9a9 fix: Fix code-formula model for new docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-14 15:17:59 +01:00
Christoph Auer
f46c3990b0 fix: Fix code_formula test unit, update test-cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-14 14:54:21 +01:00
Tobias Strebitzer
00d9405b0a
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945)
* feat: Implement csv backend and format detection

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

* test: Implement csv parsing and format tests

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

* docs: Add example and CSV format documentation

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

* feat: Add support for various CSV dialects and update documentation

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

* feat: Add validation for delimiters and tests for inconsistent csv files

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

---------

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
2025-02-14 08:55:09 +01:00
Michele Dolfi
2716c7d4ff
feat: Introduce the enable_remote_services option to allow remote connections while processing (#941)
* feat: Introduce the allow_remote_services option to allow remote connections while processing

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add option in the example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* enhance docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename to enable_remote_services

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-12 15:18:01 +01:00
Michele Dolfi
5101e2519e
feat: allow artifacts_path to be defined as ENV (#940)
* allow the artifacts_path to be defined as ENV

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add check if artifacts_path exists and is dir

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-12 13:08:37 +01:00
Nikos Livathinos
c47ae700ec
fix: Fix the initialization of the TesseractOcrModel (#935)
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-11 12:27:12 +01:00
Christoph Auer
cf78d5b7b9
feat: Add content_layer property to items to address body, furniture and other roles (#735)
* feat: Pass predicted page-headers and page-footers through to DoclingDocument furniture

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Update all test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update all test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update all test cases again

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock to final docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 12:07:49 +01:00
Michele Dolfi
c18f47c5c0
fix: remove unused httpx (#919)
* remove unused httpx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use requests instead of httpx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove more usage of httpx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-07 17:51:31 +01:00
Michele Dolfi
4cc6e3ea5e
feat: Describe pictures using vision models (#259)
* draft for picture description models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* vlm description using AutoModelForVision2Seq

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add generation options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update vlm API

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow only localhost traffic

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename model

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* do not run with vlm api

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* more renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix examples path

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply CLI download login

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix name of cli argument

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use with_smolvlm in models download

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-07 16:30:42 +01:00
Michele Dolfi
02faf5376b
refactor: use org--name in artifacts-path (#912)
use org--name in artifacts-path

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-07 13:58:05 +01:00
Panos Vagenas
90b766e2ae
fix(markdown): handle nested lists (#910)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-02-07 12:55:12 +01:00
Michele Dolfi
9114ada7bc
fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903)
fix: Support for RTL programmatic documents
fix(parser): detect and handle rotated pages
fix(parser): fix bug causing duplicated text
fix(formula): improve stopping criteria
chore: update lock file
fix: temporary constrain beautifulsoup


* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* cleaned up the data folder in the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added three test-files for right-to-left

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix black

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* Add code to expose text direction of cell

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* new test file

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix mypy reports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix example filepaths

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test data results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin wheel of latest docling-parse release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove debugging code

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix path to files in example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Revert unwanted RTL additions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix test data paths in examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-07 08:43:31 +01:00
Michele Dolfi
ed74fe2ec0
feat: new artifacts path and CLI utility (#876)
* fix artifacts path

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docling-models utility

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* missing formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename utility to docling-tools

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename download methods and deprecation warnings

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* propagate artifacts path usage for ocr models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move function to utils

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove unused file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* simplify downloading specific model(s)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* minor refactor

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-02-06 15:46:32 +01:00
Vladimir Gurevich
722a6eb7b9
fix(msword_backend): handle conversion error in label parsing (#896)
Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors.

Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com>
Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com>
2025-02-06 12:30:51 +01:00
Michele Dolfi
5ad6de0560
fix: enrichment models batch size and expose picture classifier (#878)
* expose picture classifier in CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use different batch size in each model

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove batch size from CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cleanup imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-05 11:46:01 +01:00
Panos Vagenas
5ac2887e4a
fix(markdown): fix parsing if doc ending with table (#873)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-02-03 14:38:38 +01:00
Panos Vagenas
94751a78f4
fix(markdown): add support for HTML content (#855)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-02-03 12:21:05 +01:00
Michele Dolfi
6a76b49a47
feat: Expose equation exports (#869)
* pin new docling-core and exploit it via assembler changes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update with docling-core release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-03 10:31:19 +01:00
Cesar Berrospi Ramis
0cd81a8122
fix(docx): merged table cells not properly converted (#857)
* fix(docx): merged cells not properly converted

Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: add type hinting to docx backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-03 10:20:03 +01:00
Maxim Lysak
eff16b62cc
fix: Processing of placeholder shapes in pptx that have text but no bbox (#868)
Processing of placeholder shapes in pptx that have text but no bbox

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-03 09:33:33 +01:00
Maxim Lysak
b1cf796730
fix: KeyError in tableformer prediction (#854)
* fix for KeyError in tableformer prediction

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* chore: rewrite cumbersome dictionary checking

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-31 17:00:14 +01:00
Christoph Auer
70d68b6164
feat: Add option to define page range (#852)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-31 15:23:00 +01:00
Maxim Lysak
d727b04ad0
feat(docx): Support of SDTs in docx backend (#853)
Support of table of content containers in docx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-01-31 14:52:24 +01:00
Maxim Lysak
2c037ae62e
fix: Fixed docx import with headers that are also lists (#842)
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update docling/backend/msword_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>

* Update docling/backend/msword_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-31 10:51:21 +01:00
Michele Dolfi
2a1f8afe7e
fix: use new add_code in html backend and add more typing hints (#850)
fix add_code in html backend and add more typing hints

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-31 09:54:17 +01:00
Panos Vagenas
bccb022fc8
fix(markdown): fix empty block handling (#843)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-30 16:22:29 +01:00
Maxim Lysak
fea0a99a95
fix: Fix for the crash when encountering WMF images in pptx and docx (#837)
* Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated faq

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-01-30 14:58:27 +01:00
Panos Vagenas
5aed9f8aeb
fix: fix single newline handling in MD backend (#824)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-28 19:05:55 +01:00
Cesar Berrospi Ramis
adf6353483
fix: use file extension if filetype fails with PDF (#827)
Filetype library may not identify some files as PDF. Leverage the file extension
as a simple solution.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-28 19:03:54 +01:00
Michele Dolfi
6882e6c38d
feat(CLI): Expose code and formula models in the CLI (#820)
feat: expose code and formula models in the CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-28 06:26:03 +01:00
Cesar Berrospi Ramis
a112d7a035
fix: parse html with omitted body tag (#818)
* fix: parse HTML files without body tag

Parse HTML files without 'body' tag, since it is optional in HTML5 specification.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* test: ensure docling converts HTML without body tag

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-27 16:59:00 +01:00
Panos Vagenas
95b293a723
feat: add platform info to CLI version printout (#816)
* feat: add platform info to CLI version printout

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Update main.py

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* add Python implementation & language versions

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-27 16:04:57 +01:00
Yorick Terweijden
53327552e8
feat(ocr): expose rec_keys_path in RapidOcrOptions to support custom dictionaries (#786)
* Expose `rec_keys_path` in RapidOcrOptions to support custom dictionaries

- Added `rec_keys_path` to `RapidOcrOptions` to align with RapidOCR's capability to use custom character dictionaries.
- Passed `rec_keys_path` to `RapidOcrModel` initialization, ensuring the recognition model can load the correct dictionary (e.g., for Latin characters).

Signed-off-by: Yorick Terweijden <yorick@spread.ai>

* style(rapidocr-options): fix alignment of `rec_keys_path` comment

Adjusted the alignment of the comment for `rec_keys_path` to maintain consistent formatting. No functional changes were made.

Signed-off-by: Yorick Terweijden <yorick@spread.ai>

---------

Signed-off-by: Yorick Terweijden <yorick@spread.ai>
2025-01-27 13:38:15 +01:00
Cesar Berrospi Ramis
c2ae1cc4ca
docs: description of supported formats and backends (#788)
* chore: remove type-ignore marks for attaching text to non GroupItems

After commit b74208 of docling-core, text items can be attached to any NodeItem
and therefore the ignore[arg-type] type marks can be removed.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* test: remove unnecessary imports

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* docs: add documentation on supported formats and backends

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* docs: add notebook example with XML backends

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-26 08:10:33 +01:00
Nikos Livathinos
3be2fb581f
feat: Introduce automatic language detection in TesseractOcrCliModel (#800)
* feat: Introduce automatic language detection in tesseract_ocr_cli model. Extend unit tests.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* docs: Add example how to use "auto" language with tesseract OCR engines

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Refactor the TesseractOcrModel and TesseractOcrCliModel to validate if the auto-detected
language is installed in the system and if not fall back to a default option without language.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-01-26 08:07:56 +01:00
Peter W. J. Staar
a458e298ca
fix: added extraction of byte-images in excel (#804)
* fix(msexcel): ignore Mypy checking for _find_images_in_sheet function

Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local>

* fixed some issues

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* pinned pillow in pyproject

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-24 18:48:02 +01:00
Matteo
16a218d871
feat: New document picture classifier (#805)
* figure classifier

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* gt for e2e tests

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* tests

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

---------

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
2025-01-24 18:05:51 +01:00
Panos Vagenas
88a0e66adc
feat: add Docling JSON ingestion (#783)
* feat: add Docling JSON ingestion

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* update conversion as per review comments, add tests, revert Docling JSON disambiguation, document intricacies

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Update docling/backend/json/docling_json_backend.py

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-24 18:05:23 +01:00
Yusik Kim
e9768ae6a5
chore: expose draw_clusters function (#803)
feat: expose draw_clusters function



add type annotations to function signature

Signed-off-by: Yusik Kim <kmyusk@gmail.com>
2025-01-24 17:35:29 +01:00
Matteo
3213b247ad
feat: Code and equation model for PDF and code blocks in markdown (#752)
* propagated changes for new CodeItem class

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* Rebased branch on latest main. changes for CodeItem

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed unused files

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* chore: update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* pin latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docling-core pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use new add_code in backends and update typing in MD backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* added if statement for backend

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed unused import

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed print statements

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* gt for new pdf

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* Update docling/pipeline/standard_pdf_pipeline.py

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>

* fixed doc comment of __call__ function of code_formula_model

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* fix artifacts_path type

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move expansion_factor to base class

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-24 16:54:22 +01:00
Pavel Denisov
8543c22687
feat: add "auto" language for TesseractOcr (#759)
* Add "auto" language for TesseractOcr

Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>

* Add tesseract-ocr-script-latn installation for the "auto" language

Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>

* Modify "auto" language in TesseractOcr to initialize the script readers lazily

Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>

* Finalize script readers

Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>

* Fix script models prefix for Linux

Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>

---------

Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
2025-01-23 12:40:50 +01:00
Michele Dolfi
57fc28d3d8
refactor: allow the usage of backends in the enrich models and generalize the interface (#742)
* fix get image with cropbox

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow the usage of backends in the enrich models and generalize the interface

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move logic in BaseTextImageEnrichmentModel

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-15 09:52:38 +01:00
Christoph Auer
5a060f237d
fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719)
fix: Properly care for all bitmap elements in OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-10 10:38:49 +01:00
Christoph Auer
5cb4cf6f19
fix: Correct scaling of debug visualizations, tune OCR (#700)
* fix: Correct scaling of debug visualizations, tune OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: remove unused imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Update docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-08 12:26:44 +01:00
Christoph Auer
42856fdf79
fix: Let BeautifulSoup detect the HTML encoding (#695)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-07 15:49:28 +01:00
Jinfeng Sun
d49650c54f
fix(mspowerpoint): handle invalid images in PowerPoint slides (#650)
- Add error handling for images that cannot be loaded by Pillow
- Improve resilience when encountering corrupted or unsupported image formats
- Maintain processing of other slide elements even if an image fails to load

Signed-off-by: Tendo33 <sjf1998112@gmail.com>
2025-01-07 13:58:10 +01:00
Luke Harrison
0ee849e8bc
feat: added http header support for document converter and cli (#642)
* added http header support for document converter and cli

Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>

* fixed formatting and typing issues

Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>

* use pydantic to parse dict

suggested by @dolfim-ibm

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>

---------

Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-07 10:15:14 +01:00