Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API.
Signed-off-by: Václav Vančura <commit@vancura.dev>
This commit completely revamps the Actor implementation with two major improvements:
1) CRITICAL CHANGE: Switch to official docling-serve image
* Now using quay.io/ds4sd/docling-serve-cpu:latest as base image
* Eliminates need for custom docling installation
* Ensures compatibility with latest docling-serve features
* Provides more reliable and consistent document processing
2) Fix Apify Actor KVS storage issues:
* Standardize key names to follow Apify conventions:
- Change "OUTPUT_RESULT" to "OUTPUT"
- Change "DOCLING_LOG" to "LOG"
* Add proper multi-stage Docker build:
- First stage builds dependencies including apify-cli
- Second stage uses official image and adds only necessary tools
* Fix permission issues in Docker container:
- Set up proper user and directory permissions
- Create writable directories for temporary files and models
- Configure environment variables for proper execution
3) Solve EACCES permission errors during CLI version checks:
* Create temporary HOME directory with proper write permissions
* Set APIFY_DISABLE_VERSION_CHECK=1 environment variable
* Add NODE_OPTIONS="--no-warnings" to suppress update checks
* Support --no-update-notifier CLI flag when available
4) Improve code organization and reliability:
* Create reusable upload_to_kvs() function for all KVS operations
* Ensure log files are uploaded before tools directory is removed
* Set proper MIME types based on output format
* Add detailed error reporting and proper cleanup
* Display final output URLs for easy verification
This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline.
Signed-off-by: Václav Vančura <commit@vancura.dev>
This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:
- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes
The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.
Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <commit@vancura.dev>
Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible.
- Introduce dataset record creation with processing results, including a success status and output file URL.
- Modify completion message to indicate successful processing and provide a link to the results.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- `ERR_INVALID_INPUT` for missing document URL
- `ERR_URL_INACCESSIBLE` for inaccessible URLs
- `ERR_DOCLING_FAILED` for Docling command failures
- `ERR_OUTPUT_MISSING` for missing or empty output files
- `ERR_STORAGE_FAILED` for failures in storing the output document
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup
Signed-off-by: Václav Vančura <commit@vancura.dev>
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:
- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.
Signed-off-by: Václav Vančura <commit@vancura.dev>
Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.
Signed-off-by: Václav Vančura <commit@vancura.dev>
Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.
Signed-off-by: Václav Vančura <commit@vancura.dev>
- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.
Signed-off-by: Václav Vančura <commit@vancura.dev>