Commit Graph

13 Commits

Author SHA1 Message Date
Václav Vančura
9f86971fad Actor: Replace Docling CLI with docling-serve API
This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:

- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes

The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.

Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:22 +01:00
Václav Vančura
193101e52c Actor: Fix the Apify call syntax and final result URL message
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:15 +01:00
Václav Vančura
5ecd4a48aa Actor: Normalize key-value store terminology
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:48 +01:00
Václav Vančura
7c486ce0cc Actor: Enhance error handling and data logging
- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible.
- Introduce dataset record creation with processing results, including a success status and output file URL.
- Modify completion message to indicate successful processing and provide a link to the results.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:44 +01:00
Václav Vančura
1ca8251844 Actor: Add specific error codes for better error handling
- `ERR_INVALID_INPUT` for missing document URL
- `ERR_URL_INACCESSIBLE` for inaccessible URLs
- `ERR_DOCLING_FAILED` for Docling command failures
- `ERR_OUTPUT_MISSING` for missing or empty output files
- `ERR_STORAGE_FAILED` for failures in storing the output document

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:40 +01:00
Václav Vančura
493d405cab Actor: Fix quoting in DOC_CONVERT_CMD variable
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:33 +01:00
Václav Vančura
39bcc52c1b Actor: Adding input document URL validation
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:30 +01:00
Václav Vančura
1ba5aeefdc Actor: Documentation update
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:26 +01:00
Václav Vančura
b745459a34 Actor: Enhance Dockerfile with additional utilities and env vars
- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:04 +01:00
Václav Vančura
7d651eb61f Actor: Improve script logging and error handling
- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:47 +01:00
Václav Vančura
ff7d64b421 Actor: Improve shell script robustness and error handling
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:

- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:36 +01:00
Václav Vančura
67e1129365 Actor: Documentation update
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>
2025-03-13 10:36:46 +01:00
Václav Vančura
4d13bb2650 Actor: Initial implementation
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>
2025-03-13 10:36:01 +01:00