Commit Graph

333 Commits

Author SHA1 Message Date
Adam Kliment
1c9d8e29b0 Actor: Always output a zip
Signed-off-by: Adam Kliment <adam@netmilk.net>
2025-03-13 10:40:03 +01:00
Adam Kliment
7cd1f06868 Actor: Fixed input getter
Signed-off-by: Adam Kliment <adam@netmilk.net>
2025-03-13 10:39:50 +01:00
Václav Vančura
1fe80d3c23 Actor: Removing obsolete actor.json keys
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:46 +01:00
Václav Vančura
72077c109d Actor: Update CHANGELOG and README for Docker and API changes
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:42 +01:00
Václav Vančura
5f5c0a9d50 Actor: Refactor actor.sh and add docling_processor.py
Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:38 +01:00
Václav Vančura
7a5dc3c438 Actor: Overhaul the implementation using official docling-serve image
This commit completely revamps the Actor implementation with two major improvements:

1) CRITICAL CHANGE: Switch to official docling-serve image
   * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image
   * Eliminates need for custom docling installation
   * Ensures compatibility with latest docling-serve features
   * Provides more reliable and consistent document processing

2) Fix Apify Actor KVS storage issues:
   * Standardize key names to follow Apify conventions:
     - Change "OUTPUT_RESULT" to "OUTPUT"
     - Change "DOCLING_LOG" to "LOG"
   * Add proper multi-stage Docker build:
     - First stage builds dependencies including apify-cli
     - Second stage uses official image and adds only necessary tools
   * Fix permission issues in Docker container:
     - Set up proper user and directory permissions
     - Create writable directories for temporary files and models
     - Configure environment variables for proper execution

3) Solve EACCES permission errors during CLI version checks:
   * Create temporary HOME directory with proper write permissions
   * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable
   * Add NODE_OPTIONS="--no-warnings" to suppress update checks
   * Support --no-update-notifier CLI flag when available

4) Improve code organization and reliability:
   * Create reusable upload_to_kvs() function for all KVS operations
   * Ensure log files are uploaded before tools directory is removed
   * Set proper MIME types based on output format
   * Add detailed error reporting and proper cleanup
   * Display final output URLs for easy verification

This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:30 +01:00
Václav Vančura
9f86971fad Actor: Replace Docling CLI with docling-serve API
This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:

- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes

The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.

Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:22 +01:00
Václav Vančura
11f2960907 Actor: Add section on Actors to README
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:19 +01:00
Václav Vančura
193101e52c Actor: Fix the Apify call syntax and final result URL message
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:15 +01:00
Václav Vančura
531c135899 Actor: Update README with output URL details
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:12 +01:00
Václav Vančura
5bbd1d34eb Actor: Adding dataset schema
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:09 +01:00
Václav Vančura
3cdb1b31c7 Actor: Adding CHANGELOG.md
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:03 +01:00
Václav Vančura
3245e1b8b7 Actor: Enhance README.md with output details
Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:51 +01:00
Václav Vančura
5ecd4a48aa Actor: Normalize key-value store terminology
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:48 +01:00
Václav Vančura
7c486ce0cc Actor: Enhance error handling and data logging
- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible.
- Introduce dataset record creation with processing results, including a success status and output file URL.
- Modify completion message to indicate successful processing and provide a link to the results.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:44 +01:00
Václav Vančura
1ca8251844 Actor: Add specific error codes for better error handling
- `ERR_INVALID_INPUT` for missing document URL
- `ERR_URL_INACCESSIBLE` for inaccessible URLs
- `ERR_DOCLING_FAILED` for Docling command failures
- `ERR_OUTPUT_MISSING` for missing or empty output files
- `ERR_STORAGE_FAILED` for failures in storing the output document

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:40 +01:00
Václav Vančura
cc334bf714 Actor: Documentation update
Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:36 +01:00
Václav Vančura
493d405cab Actor: Fix quoting in DOC_CONVERT_CMD variable
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:33 +01:00
Václav Vančura
39bcc52c1b Actor: Adding input document URL validation
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:30 +01:00
Václav Vančura
1ba5aeefdc Actor: Documentation update
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:26 +01:00
Václav Vančura
31056d10e4 Actor: Fixing example PDF document URLs
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:22 +01:00
Václav Vančura
2d953663f4 Actor: Adding the "Run on Apify" button
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:19 +01:00
Václav Vančura
08eaa19b26 Actor: Adding the Apify FirstPromoter integration
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:15 +01:00
Václav Vančura
93db85506b Actor: README update
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:09 +01:00
Václav Vančura
b745459a34 Actor: Enhance Dockerfile with additional utilities and env vars
- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:04 +01:00
Václav Vančura
1b6d4b5c50 Actor: README update
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:38:00 +01:00
Václav Vančura
e261111daa Actor: Adding README
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:56 +01:00
Václav Vančura
f064f762f5 Actor: Updating Docling to 2.17.0
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:51 +01:00
Václav Vančura
7d651eb61f Actor: Improve script logging and error handling
- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:47 +01:00
Václav Vančura
ff7d64b421 Actor: Improve shell script robustness and error handling
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:

- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:36 +01:00
Václav Vančura
dde401d134 Actor: Update Docker configuration for improved security
- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:31 +01:00
Václav Vančura
b2ac6cc218 Actor: Create Apify user home directory in Docker setup
Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:26 +01:00
Václav Vančura
784571f9ce Actor: Fix apify-cli version problem
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:19 +01:00
Václav Vančura
4dce886b17 Actor: Update dependencies with fixed versions
Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:14 +01:00
Václav Vančura
ac7c5053f0 Actor: Add Docker image metadata labels
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:10 +01:00
Václav Vančura
e1adc4ee8f Actor: Optimize Dockerfile with security and size improvements
- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:37:04 +01:00
Václav Vančura
19f612c009 Actor: Enhance Docker security with proper user permissions
- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.

Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:36:57 +01:00
Václav Vančura
ae491b0516 Actor: Switching Docker to python:3.11-slim-bookworm
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:36:54 +01:00
Václav Vančura
67e1129365 Actor: Documentation update
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>
2025-03-13 10:36:46 +01:00
Václav Vančura
66287e45a5 Actor: Moving the badge where it belongs
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:36:40 +01:00
Václav Vančura
cb56159b5d Actor: Adding the Actor badge
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:36:33 +01:00
Václav Vančura
352301b58d Actor: .dockerignore update
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:36:23 +01:00
Václav Vančura
4d13bb2650 Actor: Initial implementation
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>
2025-03-13 10:36:01 +01:00
github-actions[bot]
235ae8765d chore: bump version to 2.15.1 [skip ci] 2025-03-13 10:35:53 +01:00
Christoph Auer
4200fb5632 fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719)
fix: Properly care for all bitmap elements in OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>
2025-03-13 10:26:27 +01:00
Panos Vagenas
9a6b5c8c8d
docs: add pointers to LangChain-side docs (#718)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-09 17:36:46 +01:00
Panos Vagenas
4fa8028bd8
docs: add LangChain docs (#717)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-09 14:12:05 +01:00
Michele Dolfi
e64b5a2f62
fix: allow earlier requests versions (#716)
allow earlier requests versions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-09 13:30:40 +01:00
github-actions[bot]
9a94b54f6c chore: bump version to 2.15.0 [skip ci] 2025-01-08 12:06:38 +00:00
Christoph Auer
5cb4cf6f19
fix: Correct scaling of debug visualizations, tune OCR (#700)
* fix: Correct scaling of debug visualizations, tune OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: remove unused imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Update docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-08 12:26:44 +01:00