* feat: add backend options support to document backends
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: enhance document backends with generic backend options and improve HTML image handling
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Refactor tests for declarativebackend
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): improve image caption handling and ensure backend options are set correctly
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix: enhance HTML backend image handling and add support for local file paths
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: Add ground truth data for test data
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): skip loading SVG files in image data handling
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): simplify backend options and address gaps
Backend options for DeclarativeDocumentBackend classes and only when necessary.
Refactor caption parsing in 'img' elements and remove dummy text.
Replace deprecated annotations from Typing library with native types.
Replace typing annotations according to pydantic guidelines.
Some documentation with pydantic annotations.
Fix diff issue with test files.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(html): add tests and fix bugs
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(html): refactor backend options
Move backend option classes to its own module within datamodel package.
Rename 'source_location' with 'source_uri' in HTMLBackendOptions.
Rename 'image_fetch' with 'fetch_images' in HTMLBackendOptions.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* refactor(markdown): create a class for the markdown backend options
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Rich tables support for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated and added tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Refactored parse_table_data in html_backend into few smaller functions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Changing scope of few functions in html_backend.py, making them static, when possible
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* fix(HTML): parse footer tag as a section in furniture
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): add test for body vs furniture in HTML parser.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* re-implement links for html backend.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* implement hack for images.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
Run Docs CD / build-deploy-docs (push) Failing after 1m31s
Run Docs CI / build-docs (push) Failing after 54s
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* added the contentlayer to html-backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the handle_image function
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted code of html backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* test(html): add more info if a test case fails
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor(html): put parsed item in body if doc has no header
In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: set TextItem label to 'text' instead of 'paragraph'
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: upgrade BeautifulSoup4 with type hints
Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* build: allow beautifulsoup4 version 4.12.3
Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Upgraded Layout Postprocessing, sending old code back to ERZ
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
* fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Correct the way to set GPU for EasyOCR, RapidOCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Ocr AccleratorDevice
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Merge pull request #556 from DS4SD/cau/layout-processing-improvement
feat: layout processing improvements and bugfixes
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update HF model ref, reset test generate
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Repin to release package versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Many layout processing improvements, add document index type
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update pinnings to docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix table box snapping
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for cluster pre-ordering
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce OCR confidence, propagate to orphan in post-processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix form and key value area groups
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust confidence in EasyOcr
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Roll back CLI changes from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docling-core pinning
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Annoying fixes for historical python versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test GT for legacy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Comment cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>