docling/tests
Christoph Auer 60dc852f16
feat: Updated Layout processing with forms and key-value areas (#530)
* Upgraded Layout Postprocessing, sending old code back to ERZ

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Implement hierachical cluster layout processing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested cluster processing through full pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested clusters through GLM as payload

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports again

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.

* fix: Improve the pydantic objects in the pipeline_options and imports.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Updated test ground-truth

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated test ground-truth (again), bugfix for empty layout

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Do proper check to set the device in EasyOCR, RapidOCR.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Correct the way to set GPU for EasyOCR, RapidOCR

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Ocr AccleratorDevice

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Merge pull request #556 from DS4SD/cau/layout-processing-improvement

feat: layout processing improvements and bugfixes

* Update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update HF model ref, reset test generate

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Repin to release package versions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Many layout processing improvements, add document index type

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update pinnings to docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix table box snapping

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for cluster pre-ordering

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce OCR confidence, propagate to orphan in post-processing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix form and key value area groups

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust confidence in EasyOcr

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Roll back CLI changes from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docling-core pinning

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Annoying fixes for historical python versions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated test GT for legacy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Comment cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-17 17:32:24 +01:00
..
data feat: Updated Layout processing with forms and key-value areas (#530) 2024-12-17 17:32:24 +01:00
data_scanned feat: Updated Layout processing with forms and key-value areas (#530) 2024-12-17 17:32:24 +01:00
__init__.py fix: Add unit tests (#51) 2024-08-30 14:08:20 +02:00
test_backend_asciidoc.py feat: Add pipeline timings and toggle visualization, establish debug settings (#183) 2024-10-30 15:04:19 +01:00
test_backend_docling_parse_v2.py chore: make tests lighter (#228) 2024-11-04 14:02:28 +01:00
test_backend_docling_parse.py chore: make tests lighter (#228) 2024-11-04 14:02:28 +01:00
test_backend_html.py fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
test_backend_msexcel.py feat: added excel backend (#334) 2024-11-19 12:21:17 +01:00
test_backend_msword.py fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
test_backend_patent_uspto.py feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
test_backend_pdfium.py chore: make tests lighter (#228) 2024-11-04 14:02:28 +01:00
test_backend_pptx.py feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
test_cli.py test: generate file from CLI in a temporary directory (#618) 2024-12-17 16:35:42 +01:00
test_e2e_conversion.py feat: Add pipeline timings and toggle visualization, establish debug settings (#183) 2024-10-30 15:04:19 +01:00
test_e2e_ocr_conversion.py feat(ocr): added support for RapidOCR engine (#415) 2024-11-27 13:57:41 +01:00
test_input_doc.py feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
test_interfaces.py fix: improve handling of disallowed formats (#429) 2024-12-03 12:45:32 +01:00
test_invalid_input.py fix: improve handling of disallowed formats (#429) 2024-12-03 12:45:32 +01:00
test_legacy_format_transform.py fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
test_options.py feat: Introduce support for GPU Accelerators (#593) 2024-12-13 17:45:22 +01:00
verify_utils.py feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290) 2024-11-12 09:46:14 +01:00