Maksym Lysak
66532eadb6
More elegant solution in removing the input prompt
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:20:10 +01:00
Maksym Lysak
e486eb1720
Cleaned up unnecessary logging
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:20:10 +01:00
Christoph Auer
55fa4eb4e3
Fix repo id
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-24 13:20:05 +01:00
Christoph Auer
6f9f4f4aee
Update minimal smoldocling example
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-24 13:18:25 +01:00
Maksym Lysak
b1df461ca8
Added captions for the images for SmolDocling assembly code, improved provenance definition for all elements
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:19 +01:00
Maksym Lysak
d7abe1b1cd
Updated example of Smol Docling usage
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:19 +01:00
Maksym Lysak
479ee239aa
New assembly code for latest model revision, updated prompt and parsing of doctags, updated logging
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:19 +01:00
Maksym Lysak
7c4ab5c716
Moved artifacts_path for SmolDocling into vlm_options instead of global pipeline option
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:19 +01:00
Maksym Lysak
f2751e11f9
Introduced SmolDoclingOptions to configure model parameters (such as query and artifacts path) via client code, see example in minimal_smol_docling. Provisioning for other potential vlm all-in-one models.
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:15 +01:00
Maksym Lysak
88b9ac6706
Fixing doctags starting tag, that broke elements on first line during assembly
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:12:55 +01:00
Maksym Lysak
0fe12d819a
Updated vlm pipeline assembly and smol docling model code to support updated doctags
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:12:55 +01:00
Maksym Lysak
f6d123a01c
Flipped keep_backend to True for vlm_pipeline assembly to work
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:12:55 +01:00
Maksym Lysak
9901729d8c
Exposed "force_backend_text" as pipeline parameter
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:12:51 +01:00
Maksym Lysak
0dc3ac43b1
Added capability for vlm_pipeline to grab text from preconfigured backend
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:57 +01:00
Maksym Lysak
e0929781f4
Added tokens/sec measurement, improved example
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:57 +01:00
Maksym Lysak
437053572d
Replaced hardcoded otsl tokens with the ones from docling-core tokens.py enum
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:57 +01:00
Maksym Lysak
2a43c199d5
Cleaned up logs, added pages to vlm_pipeline, basic timing per page measurement in smol_docling models
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:57 +01:00
Maksym Lysak
61bb9dbba2
Properly propagating image data per page, together with predicted tags in VLM pipeline. This enables correct figure extraction and page numbers in provenances
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
01c46e24b1
Fix for table span compute in vlm_pipeline
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
ef079e4e78
Enabled figure support in vlm_pipeline
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
1b968e4984
Fixes to preserve page image and demo export to html
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
3c4c647615
WIP, first working code for inference of SmolDocling, and vlm pipeline assembly code, example included.
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
03c8d45790
wip smolDocling inference and vlm pipeline
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:52 +01:00
Christoph Auer
dc3a388aa2
Skeleton for SmolDocling model and VLM Pipeline
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 11:46:04 +01:00
Suehtam
1d17e7397a
test: avoid testing exact JSON in CSV backend ( #1038 )
...
* feat: updated verify_export
Moved verify_export to verify_utils
Reuse verify_export in tests
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>
* feat: replace verify_export with verify_document in CSV conversion tests
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>
---------
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>
2025-02-24 08:10:40 +01:00
github-actions[bot]
d8a81c3168
chore: bump version to 2.24.0 [skip ci]
2025-02-20 18:31:20 +00:00
Christoph Auer
c93e36988f
feat: Implement new reading-order model ( #916 )
...
* Implement new reading-order model, replacing DS GLM model (WIP)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update reading-order model branch
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lockfile [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add captions, footnotes and merges [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updates for reading-order implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updates for reading-order implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests and lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes, update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add normalization, update tests again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests with code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Push final lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* sanitize text
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Inlcude furniture, Update tests with furniture
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix content_layer assignment
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: Delete empty file docling/models/ds_glm_model.py
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-20 17:51:17 +01:00
github-actions[bot]
c031a7ae47
chore: bump version to 2.23.1 [skip ci]
2025-02-20 16:26:41 +00:00
Cesar Berrospi Ramis
1ac010354f
test: avoid testing exact JSON ( #1027 )
...
* test: avoid testing exact JSON
Avoid testing exact JSON output in html and xml backends.
Reuse the JSON verify helper function among backend test files.
Improve type annotations in html backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Update tests/test_backend_patent_uspto.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-02-20 16:20:07 +01:00
fanszoro
6796f0a132
fix: Runtime error when Pandas Series is not always of string type ( #1024 )
...
Signed-off-by: fan <fansluck@qq.com>
2025-02-20 15:41:41 +01:00
Christoph Auer
dfcc30dddb
chore: Update tests and lockfile ( #1021 )
...
Update tests and lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 16:51:53 +01:00
Panos Vagenas
27c04007bc
docs: revamp picture description example ( #1015 )
...
* docs: revamp picture description example
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* Improvements for visualization example (#1017 )
* fix colab install, use granite and improve viz of description
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* switch docs to notbook
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* show results with all models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* show other vlm
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-19 11:28:54 +01:00
Cesar Berrospi Ramis
7450050ace
refactor: upgrade BeautifulSoup4 with type hints ( #999 )
...
* refactor: upgrade BeautifulSoup4 with type hints
Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* build: allow beautifulsoup4 version 4.12.3
Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-18 11:30:47 +01:00
github-actions[bot]
75db61127c
chore: bump version to 2.23.0 [skip ci]
2025-02-17 14:22:49 +00:00
Maxim Lysak
6e75f0b5d3
fix: Revise DocTags, fix iterate_items to output content_layer in items ( #965 )
...
* Testing fix for docling-core dt
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* fix: Fix code_formula test unit, update test-cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Fix code-formula model for new docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Update fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test cases for office formats
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update deps and lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 14:11:55 +01:00
Ahmed Nassar
77eb77bdc2
feat: Support cuda:n GPU device allocation ( #694 )
...
* Adding multi-gpu support, and cuda device allocation
Signed-off-by: ahn <ahn@zurich.ibm.com>
* Fixes pydantic exception with cuda:n
Signed-off-by: ahn <ahn@zurich.ibm.com>
* Pydantic field validator and comment restored.
Signed-off-by: ahn <ahn@zurich.ibm.com>
* chore: Accept AcceleratorDevice enum type
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Resetted some options to default, removed EasyOCR model wrap.
Signed-off-by: ahn <ahn@zurich.ibm.com>
* Fixed rebased issues
Signed-off-by: ahn <ahn@zurich.ibm.com>
* Revert accelerator test options
Signed-off-by: ahn <ahn@zurich.ibm.com>
---------
Signed-off-by: ahn <ahn@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: ahn <ahn@sonny.zuvela.ibm.com>
Co-authored-by: ahn <ahn@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 11:31:13 +01:00
Cesar Berrospi Ramis
428b656793
feat(xml-jats): parse XML JATS documents ( #967 )
...
* chore(xml-jats): separate authors and affiliations
In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix(xml-jats): replace new line character by a space
Instead of removing new line character from text, replace it by a space character.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* feat(xml-jats): improve existing parser and extend features
Partially support lists, respect reading order, parse more sections, support equations, better text formatting.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore(xml-jats): rename PubMed objects to JATS
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-17 10:43:31 +01:00
Michele Dolfi
e1436a8b05
test: validate actual docitems in tests ( #966 )
...
* validate actual docitems in tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove verbose print
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* disable test generation
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-14 17:47:53 +01:00
github-actions[bot]
ffbde1d1b0
chore: bump version to 2.22.0 [skip ci]
2025-02-14 08:53:20 +00:00
Tobias Strebitzer
00d9405b0a
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
...
* feat: Implement csv backend and format detection
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
* test: Implement csv parsing and format tests
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
* docs: Add example and CSV format documentation
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
* feat: Add support for various CSV dialects and update documentation
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
* feat: Add validation for delimiters and tests for inconsistent csv files
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
---------
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
2025-02-14 08:55:09 +01:00
Michele Dolfi
7493d5b01f
docs: update example Dockerfile with download CLI ( #929 )
...
update example Dockerfile with download CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-13 14:19:50 +01:00
Michele Dolfi
af19c03f6e
fix: update Pillow constraints ( #958 )
...
update pillow and lock deps
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-13 14:19:37 +01:00
Michele Dolfi
2d66e99b69
docs: Examples for picture descriptions ( #951 )
...
* add more examples for picture descriptions
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix merge typo
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-13 08:33:12 +01:00
Michele Dolfi
2716c7d4ff
feat: Introduce the enable_remote_services option to allow remote connections while processing ( #941 )
...
* feat: Introduce the allow_remote_services option to allow remote connections while processing
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add option in the example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* enhance docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename to enable_remote_services
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-12 15:18:01 +01:00
Michele Dolfi
5101e2519e
feat: allow artifacts_path to be defined as ENV ( #940 )
...
* allow the artifacts_path to be defined as ENV
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add check if artifacts_path exists and is dir
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-12 13:08:37 +01:00
Nikos Livathinos
c47ae700ec
fix: Fix the initialization of the TesseractOcrModel ( #935 )
...
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-11 12:27:12 +01:00
github-actions[bot]
de462090e7
chore: bump version to 2.21.0 [skip ci]
2025-02-10 11:43:05 +00:00
Christoph Auer
cf78d5b7b9
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
...
* feat: Pass predicted page-headers and page-footers through to DoclingDocument furniture
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: Update all test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: update all test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: update all test cases again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock to final docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 12:07:49 +01:00
github-actions[bot]
3e26597995
chore: bump version to 2.20.0 [skip ci]
2025-02-07 17:46:36 +00:00
Michele Dolfi
c18f47c5c0
fix: remove unused httpx ( #919 )
...
* remove unused httpx
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use requests instead of httpx
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove more usage of httpx
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-07 17:51:31 +01:00