Maksym Lysak
9ecec1d330
Updated poetry.lock
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 17:27:50 +01:00
Maksym Lysak
923f766ada
Replaced remaining strings to appropriate enums
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 16:54:59 +01:00
Maksym Lysak
a095a7c5b7
removing changes from base_pipeline
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 15:13:59 +01:00
Maksym Lysak
a7a1f32b10
Added example on how to get original predicted doctags in minimal_smol_docling
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 14:39:18 +01:00
Maksym Lysak
1dbedcbb4e
removed pipeline_options.generate_table_images from vlm_pipeline (deprecated in the pipelines)
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 14:17:06 +01:00
Maksym Lysak
0c60ef199a
Moved keep_backend = True to vlm pipeline
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:50:40 +01:00
Maksym Lysak
853544ba11
Addressing PR comments, added enabled property to SmolDocling, and related VLM pipeline option, few other minor things
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:46:47 +01:00
Maksym Lysak
b0935daec4
Removed special html code wrapping when exporting to docling document, cleaned up comments
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:22:50 +01:00
Maksym Lysak
b12f5ba80f
removed minimal_smol_docling example from CI checks
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:20:10 +01:00
Maksym Lysak
66532eadb6
More elegant solution in removing the input prompt
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:20:10 +01:00
Maksym Lysak
e486eb1720
Cleaned up unnecessary logging
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:20:10 +01:00
Christoph Auer
55fa4eb4e3
Fix repo id
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-24 13:20:05 +01:00
Christoph Auer
6f9f4f4aee
Update minimal smoldocling example
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-24 13:18:25 +01:00
Maksym Lysak
b1df461ca8
Added captions for the images for SmolDocling assembly code, improved provenance definition for all elements
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:19 +01:00
Maksym Lysak
d7abe1b1cd
Updated example of Smol Docling usage
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:19 +01:00
Maksym Lysak
479ee239aa
New assembly code for latest model revision, updated prompt and parsing of doctags, updated logging
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:19 +01:00
Maksym Lysak
7c4ab5c716
Moved artifacts_path for SmolDocling into vlm_options instead of global pipeline option
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:19 +01:00
Maksym Lysak
f2751e11f9
Introduced SmolDoclingOptions to configure model parameters (such as query and artifacts path) via client code, see example in minimal_smol_docling. Provisioning for other potential vlm all-in-one models.
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:15:15 +01:00
Maksym Lysak
88b9ac6706
Fixing doctags starting tag, that broke elements on first line during assembly
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:12:55 +01:00
Maksym Lysak
0fe12d819a
Updated vlm pipeline assembly and smol docling model code to support updated doctags
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:12:55 +01:00
Maksym Lysak
f6d123a01c
Flipped keep_backend to True for vlm_pipeline assembly to work
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:12:55 +01:00
Maksym Lysak
9901729d8c
Exposed "force_backend_text" as pipeline parameter
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 13:12:51 +01:00
Maksym Lysak
0dc3ac43b1
Added capability for vlm_pipeline to grab text from preconfigured backend
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:57 +01:00
Maksym Lysak
e0929781f4
Added tokens/sec measurement, improved example
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:57 +01:00
Maksym Lysak
437053572d
Replaced hardcoded otsl tokens with the ones from docling-core tokens.py enum
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:57 +01:00
Maksym Lysak
2a43c199d5
Cleaned up logs, added pages to vlm_pipeline, basic timing per page measurement in smol_docling models
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:57 +01:00
Maksym Lysak
61bb9dbba2
Properly propagating image data per page, together with predicted tags in VLM pipeline. This enables correct figure extraction and page numbers in provenances
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
01c46e24b1
Fix for table span compute in vlm_pipeline
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
ef079e4e78
Enabled figure support in vlm_pipeline
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
1b968e4984
Fixes to preserve page image and demo export to html
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
3c4c647615
WIP, first working code for inference of SmolDocling, and vlm pipeline assembly code, example included.
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:56 +01:00
Maksym Lysak
03c8d45790
wip smolDocling inference and vlm pipeline
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 12:56:52 +01:00
Christoph Auer
dc3a388aa2
Skeleton for SmolDocling model and VLM Pipeline
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-24 11:46:04 +01:00
Suehtam
1d17e7397a
test: avoid testing exact JSON in CSV backend ( #1038 )
...
* feat: updated verify_export
Moved verify_export to verify_utils
Reuse verify_export in tests
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>
* feat: replace verify_export with verify_document in CSV conversion tests
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>
---------
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>
2025-02-24 08:10:40 +01:00
github-actions[bot]
d8a81c3168
chore: bump version to 2.24.0 [skip ci]
2025-02-20 18:31:20 +00:00
Christoph Auer
c93e36988f
feat: Implement new reading-order model ( #916 )
...
* Implement new reading-order model, replacing DS GLM model (WIP)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update reading-order model branch
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lockfile [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add captions, footnotes and merges [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updates for reading-order implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updates for reading-order implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests and lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes, update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add normalization, update tests again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests with code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Push final lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* sanitize text
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Inlcude furniture, Update tests with furniture
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix content_layer assignment
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: Delete empty file docling/models/ds_glm_model.py
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-20 17:51:17 +01:00
github-actions[bot]
c031a7ae47
chore: bump version to 2.23.1 [skip ci]
2025-02-20 16:26:41 +00:00
Cesar Berrospi Ramis
1ac010354f
test: avoid testing exact JSON ( #1027 )
...
* test: avoid testing exact JSON
Avoid testing exact JSON output in html and xml backends.
Reuse the JSON verify helper function among backend test files.
Improve type annotations in html backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Update tests/test_backend_patent_uspto.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-02-20 16:20:07 +01:00
fanszoro
6796f0a132
fix: Runtime error when Pandas Series is not always of string type ( #1024 )
...
Signed-off-by: fan <fansluck@qq.com>
2025-02-20 15:41:41 +01:00
Christoph Auer
dfcc30dddb
chore: Update tests and lockfile ( #1021 )
...
Update tests and lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 16:51:53 +01:00
Panos Vagenas
27c04007bc
docs: revamp picture description example ( #1015 )
...
* docs: revamp picture description example
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* Improvements for visualization example (#1017 )
* fix colab install, use granite and improve viz of description
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* switch docs to notbook
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* show results with all models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* show other vlm
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-19 11:28:54 +01:00
Cesar Berrospi Ramis
7450050ace
refactor: upgrade BeautifulSoup4 with type hints ( #999 )
...
* refactor: upgrade BeautifulSoup4 with type hints
Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* build: allow beautifulsoup4 version 4.12.3
Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-18 11:30:47 +01:00
github-actions[bot]
75db61127c
chore: bump version to 2.23.0 [skip ci]
2025-02-17 14:22:49 +00:00
Maxim Lysak
6e75f0b5d3
fix: Revise DocTags, fix iterate_items to output content_layer in items ( #965 )
...
* Testing fix for docling-core dt
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* fix: Fix code_formula test unit, update test-cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Fix code-formula model for new docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Update fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test cases for office formats
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update deps and lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 14:11:55 +01:00
Ahmed Nassar
77eb77bdc2
feat: Support cuda:n GPU device allocation ( #694 )
...
* Adding multi-gpu support, and cuda device allocation
Signed-off-by: ahn <ahn@zurich.ibm.com>
* Fixes pydantic exception with cuda:n
Signed-off-by: ahn <ahn@zurich.ibm.com>
* Pydantic field validator and comment restored.
Signed-off-by: ahn <ahn@zurich.ibm.com>
* chore: Accept AcceleratorDevice enum type
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Resetted some options to default, removed EasyOCR model wrap.
Signed-off-by: ahn <ahn@zurich.ibm.com>
* Fixed rebased issues
Signed-off-by: ahn <ahn@zurich.ibm.com>
* Revert accelerator test options
Signed-off-by: ahn <ahn@zurich.ibm.com>
---------
Signed-off-by: ahn <ahn@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: ahn <ahn@sonny.zuvela.ibm.com>
Co-authored-by: ahn <ahn@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 11:31:13 +01:00
Cesar Berrospi Ramis
428b656793
feat(xml-jats): parse XML JATS documents ( #967 )
...
* chore(xml-jats): separate authors and affiliations
In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix(xml-jats): replace new line character by a space
Instead of removing new line character from text, replace it by a space character.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* feat(xml-jats): improve existing parser and extend features
Partially support lists, respect reading order, parse more sections, support equations, better text formatting.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore(xml-jats): rename PubMed objects to JATS
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-17 10:43:31 +01:00
Michele Dolfi
e1436a8b05
test: validate actual docitems in tests ( #966 )
...
* validate actual docitems in tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove verbose print
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* disable test generation
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-14 17:47:53 +01:00
github-actions[bot]
ffbde1d1b0
chore: bump version to 2.22.0 [skip ci]
2025-02-14 08:53:20 +00:00
Tobias Strebitzer
00d9405b0a
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
...
* feat: Implement csv backend and format detection
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
* test: Implement csv parsing and format tests
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
* docs: Add example and CSV format documentation
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
* feat: Add support for various CSV dialects and update documentation
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
* feat: Add validation for delimiters and tests for inconsistent csv files
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
---------
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>
2025-02-14 08:55:09 +01:00
Michele Dolfi
7493d5b01f
docs: update example Dockerfile with download CLI ( #929 )
...
update example Dockerfile with download CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-13 14:19:50 +01:00