- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup
Signed-off-by: Václav Vančura <commit@vancura.dev>
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:
- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.
Signed-off-by: Václav Vančura <commit@vancura.dev>
Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.
Signed-off-by: Václav Vančura <commit@vancura.dev>
Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Add error handling for images that cannot be loaded by Pillow
- Improve resilience when encountering corrupted or unsupported image formats
- Maintain processing of other slide elements even if an image fails to load
Signed-off-by: Tendo33 <sjf1998112@gmail.com>
* added http header support for document converter and cli
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
* fixed formatting and typing issues
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
* use pydantic to parse dict
suggested by @dolfim-ibm
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
---------
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* Add OpenContracts as an open source project
OpenContracts now offers Docling as a document ingestion and parsing pipeline
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
* Update mkdocs.yml
Added OpenContracts to the nav configs
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
---------
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
* Upgraded Layout Postprocessing, sending old code back to ERZ
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
* fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Correct the way to set GPU for EasyOCR, RapidOCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Ocr AccleratorDevice
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Merge pull request #556 from DS4SD/cau/layout-processing-improvement
feat: layout processing improvements and bugfixes
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update HF model ref, reset test generate
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Repin to release package versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Many layout processing improvements, add document index type
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update pinnings to docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix table box snapping
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for cluster pre-ordering
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce OCR confidence, propagate to orphan in post-processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix form and key value area groups
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust confidence in EasyOcr
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Roll back CLI changes from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docling-core pinning
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Annoying fixes for historical python versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test GT for legacy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Comment cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
* feat: add PATENT_USPTO as input format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add USPTO backend parser
Add a backend implementation to parse patent applications and
grants from the United States Patent Office (USPTO).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: change the name of the USPTO input format
Change the name of the patent USPTO input format to show the typical format (XML).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: address several input formats with same mime type
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: group XML backend parsers in a subfolder
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add safe initialization of PatentUsptoDocumentBackend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Update easyocr_model.py
Added this line of code to get recog_network of easyocr parameter
recog_network = self.options.recog_network
Signed-off-by: itsainii <aininawawii@gmail.com>
* Update pipeline_options.py
Added this line in EasyOcrOptions function
recog_network: Optional[str] = 'standard'
Signed-off-by: itsainii <aininawawii@gmail.com>
* Add Easyocr recog_network parameter
Signed-off-by: itsainii <aininawawii@gmail.com>
---------
Signed-off-by: itsainii <aininawawii@gmail.com>
* Upgraded Layout Postprocessing, sending old code back to ERZ
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
* fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Rollback changes from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test gt
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove unused debug settings
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Review fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Nail the accelerator defaults for MPS
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>