Commit Graph

389 Commits

Author SHA1 Message Date
Christoph Auer
6cd81e251a Inlcude furniture, Update tests with furniture
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-20 14:22:47 +01:00
Michele Dolfi
26dda63555 sanitize text
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-20 13:10:15 +01:00
Christoph Auer
fa6b7eeec4 Push final lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-20 12:00:55 +01:00
Christoph Auer
a89c19105c Update tests with code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-20 08:56:38 +01:00
Christoph Auer
53ee8ea1d8 Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 17:27:40 +01:00
Christoph Auer
857d6c4292 Add normalization, update tests again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 16:55:20 +01:00
Christoph Auer
dfcc30dddb
chore: Update tests and lockfile (#1021)
Update tests and lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 16:51:53 +01:00
Christoph Auer
eb67337e51 Fixes, update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 16:28:21 +01:00
Christoph Auer
7d55102605 Merge branch 'cau/fix-tests' of github.com:DS4SD/docling into cau/integrate-reading-order 2025-02-19 15:55:53 +01:00
Christoph Auer
c3ac8b392a Update tests and lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 15:54:07 +01:00
Christoph Auer
4e68da99b6 Add children to output after reading-order
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 15:36:07 +01:00
Panos Vagenas
27c04007bc
docs: revamp picture description example (#1015)
* docs: revamp picture description example

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* Improvements for visualization example (#1017)

* fix colab install, use granite and improve viz of description

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* switch docs to notbook

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* show results with all models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* show other vlm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-19 11:28:54 +01:00
Christoph Auer
d788bf2a6e Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 10:29:39 +01:00
Cesar Berrospi Ramis
7450050ace
refactor: upgrade BeautifulSoup4 with type hints (#999)
* refactor: upgrade BeautifulSoup4 with type hints

Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* build: allow beautifulsoup4 version 4.12.3

Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-18 11:30:47 +01:00
Christoph Auer
8606b598dc Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-18 11:24:53 +01:00
github-actions[bot]
75db61127c chore: bump version to 2.23.0 [skip ci] 2025-02-17 14:22:49 +00:00
Maxim Lysak
6e75f0b5d3
fix: Revise DocTags, fix iterate_items to output content_layer in items (#965)
* Testing fix for docling-core dt

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* fix: Fix code_formula test unit, update test-cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Fix code-formula model for new docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Update fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases for office formats

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update deps and lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 14:11:55 +01:00
Ahmed Nassar
77eb77bdc2
feat: Support cuda:n GPU device allocation (#694)
* Adding multi-gpu support, and cuda device allocation

Signed-off-by: ahn <ahn@zurich.ibm.com>

* Fixes pydantic exception with cuda:n
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Pydantic field validator and comment restored.

Signed-off-by: ahn <ahn@zurich.ibm.com>

* chore: Accept AcceleratorDevice enum type

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Resetted some options to default, removed EasyOCR model wrap.
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Fixed rebased issues
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Revert accelerator test options
Signed-off-by: ahn <ahn@zurich.ibm.com>

---------

Signed-off-by: ahn <ahn@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: ahn <ahn@sonny.zuvela.ibm.com>
Co-authored-by: ahn <ahn@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 11:31:13 +01:00
Cesar Berrospi Ramis
428b656793
feat(xml-jats): parse XML JATS documents (#967)
* chore(xml-jats): separate authors and affiliations

In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(xml-jats): replace new line character by a space

Instead of removing new line character from text, replace it by a space character.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* feat(xml-jats): improve existing parser and extend features

Partially support lists, respect reading order, parse more sections, support equations, better text formatting.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore(xml-jats): rename PubMed objects to JATS

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-17 10:43:31 +01:00
Michele Dolfi
e1436a8b05
test: validate actual docitems in tests (#966)
* validate actual docitems in tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove verbose print

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* disable test generation

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-14 17:47:53 +01:00
github-actions[bot]
ffbde1d1b0 chore: bump version to 2.22.0 [skip ci] 2025-02-14 08:53:20 +00:00
Tobias Strebitzer
00d9405b0a
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945)
* feat: Implement csv backend and format detection

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

* test: Implement csv parsing and format tests

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

* docs: Add example and CSV format documentation

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

* feat: Add support for various CSV dialects and update documentation

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

* feat: Add validation for delimiters and tests for inconsistent csv files

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>

---------

Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
2025-02-14 08:55:09 +01:00
Michele Dolfi
7493d5b01f
docs: update example Dockerfile with download CLI (#929)
update example Dockerfile with download CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-13 14:19:50 +01:00
Michele Dolfi
af19c03f6e
fix: update Pillow constraints (#958)
update pillow and lock deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-13 14:19:37 +01:00
Michele Dolfi
2d66e99b69
docs: Examples for picture descriptions (#951)
* add more examples for picture descriptions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix merge typo

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-13 08:33:12 +01:00
Michele Dolfi
2716c7d4ff
feat: Introduce the enable_remote_services option to allow remote connections while processing (#941)
* feat: Introduce the allow_remote_services option to allow remote connections while processing

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add option in the example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* enhance docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename to enable_remote_services

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-12 15:18:01 +01:00
Christoph Auer
48777b17fa Updates for reading-order implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-12 15:14:44 +01:00
Michele Dolfi
5101e2519e
feat: allow artifacts_path to be defined as ENV (#940)
* allow the artifacts_path to be defined as ENV

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add check if artifacts_path exists and is dir

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-12 13:08:37 +01:00
Nikos Livathinos
c47ae700ec
fix: Fix the initialization of the TesseractOcrModel (#935)
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-11 12:27:12 +01:00
Christoph Auer
27b896b938 Updates for reading-order implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 16:59:52 +01:00
Christoph Auer
a6ee5a4326 Add captions, footnotes and merges [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 14:08:32 +01:00
Christoph Auer
5aebaf58de Merge branch 'main' of github.com:DS4SD/docling into cau/integrate-reading-order 2025-02-10 12:46:16 +01:00
Christoph Auer
46d7342671 Update lockfile [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 12:45:37 +01:00
Christoph Auer
2046ffbbb0 Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 12:43:37 +01:00
github-actions[bot]
de462090e7 chore: bump version to 2.21.0 [skip ci] 2025-02-10 11:43:05 +00:00
Christoph Auer
cf78d5b7b9
feat: Add content_layer property to items to address body, furniture and other roles (#735)
* feat: Pass predicted page-headers and page-footers through to DoclingDocument furniture

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Update all test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update all test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update all test cases again

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock to final docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 12:07:49 +01:00
Christoph Auer
f875fbc6cf Update reading-order model branch
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 09:51:52 +01:00
github-actions[bot]
3e26597995 chore: bump version to 2.20.0 [skip ci] 2025-02-07 17:46:36 +00:00
Michele Dolfi
c18f47c5c0
fix: remove unused httpx (#919)
* remove unused httpx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use requests instead of httpx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove more usage of httpx

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-07 17:51:31 +01:00
Michele Dolfi
4cc6e3ea5e
feat: Describe pictures using vision models (#259)
* draft for picture description models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* vlm description using AutoModelForVision2Seq

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add generation options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update vlm API

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow only localhost traffic

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename model

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* do not run with vlm api

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* more renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix examples path

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply CLI download login

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix name of cli argument

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use with_smolvlm in models download

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-07 16:30:42 +01:00
Christoph Auer
a56dbc5f3f Implement new reading-order model, replacing DS GLM model (WIP)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-07 16:19:16 +01:00
github-actions[bot]
fba3cf9be7 chore: bump version to 2.19.0 [skip ci] 2025-02-07 13:36:54 +00:00
Michele Dolfi
02faf5376b
refactor: use org--name in artifacts-path (#912)
use org--name in artifacts-path

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-07 13:58:05 +01:00
Panos Vagenas
90b766e2ae
fix(markdown): handle nested lists (#910)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-02-07 12:55:12 +01:00
Michele Dolfi
9114ada7bc
fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903)
fix: Support for RTL programmatic documents
fix(parser): detect and handle rotated pages
fix(parser): fix bug causing duplicated text
fix(formula): improve stopping criteria
chore: update lock file
fix: temporary constrain beautifulsoup


* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* cleaned up the data folder in the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added three test-files for right-to-left

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix black

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* Add code to expose text direction of cell

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* new test file

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix mypy reports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix example filepaths

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test data results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin wheel of latest docling-parse release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove debugging code

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix path to files in example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Revert unwanted RTL additions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix test data paths in examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-07 08:43:31 +01:00
Michele Dolfi
ed74fe2ec0
feat: new artifacts path and CLI utility (#876)
* fix artifacts path

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docling-models utility

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* missing formatting

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename utility to docling-tools

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename download methods and deprecation warnings

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* propagate artifacts path usage for ocr models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move function to utils

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove unused file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* simplify downloading specific model(s)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* minor refactor

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-02-06 15:46:32 +01:00
Vladimir Gurevich
722a6eb7b9
fix(msword_backend): handle conversion error in label parsing (#896)
Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors.

Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com>
Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com>
2025-02-06 12:30:51 +01:00
Michele Dolfi
5ad6de0560
fix: enrichment models batch size and expose picture classifier (#878)
* expose picture classifier in CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use different batch size in each model

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove batch size from CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cleanup imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-05 11:46:01 +01:00
Panos Vagenas
17448163e7
chore: fix docs search (#880)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-02-04 11:35:34 +01:00
Nikos Livathinos
6d3fea0196
docs: Introduce example with custom models for RapidOCR (#874)
* docs: Introduce example with custom models for RapidOCR

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Exclude the example with custom RapidOCR models from the examples to run in github actions

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-04 10:07:00 +01:00