mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
feat(actor): Docling Actor on Apify infrastructure (#875)
* fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719) fix: Properly care for all bitmap elements in OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Adam Kliment <adam@netmilk.net> * chore: bump version to 2.15.1 [skip ci] * Actor: Initial implementation Signed-off-by: Václav Vančura <commit@vancura.dev> Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: .dockerignore update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the Actor badge Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Moving the badge where it belongs Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Signed-off-by: Václav Vančura <commit@vancura.dev> Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Switching Docker to python:3.11-slim-bookworm Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance Docker security with proper user permissions - Set proper ownership and permissions for runtime directory. - Switch to non-root user for enhanced security. - Use `--chown` flag in COPY commands to maintain correct file ownership. - Ensure all files and directories are owned by `appuser`. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Optimize Dockerfile with security and size improvements - Combine RUN commands to reduce image layers and overall size. - Add non-root user `appuser` for improved security. - Use `--no-install-recommends` flag to minimize installed packages. - Install only necessary dependencies in a single RUN command. - Maintain proper cleanup of package lists and caches. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add Docker image metadata labels Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update dependencies with fixed versions Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix apify-cli version problem Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Create Apify user home directory in Docker setup Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update Docker configuration for improved security - Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning. - Improve readability with consistent formatting and spacing in RUN commands. - Enhance security by properly setting up appuser home directory and permissions. - Streamline directory structure and ownership for runtime operations. - Remove redundant `.apify` directory creation as it's handled by the CLI. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Improve shell script robustness and error handling The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include: - Added proper quoting around variables to prevent word splitting. - Improved error messages and logging functionality. - Implemented a cleanup trap to ensure temporary files are removed. - Enhanced validation of input parameters and output formats. - Added better handling of the log file and its storage. - Improved command execution with proper evaluation. - Added comments for better code readability and maintenance. - Fixed potential security issues with proper variable expansion. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Improve script logging and error handling - Initialize log file at `/tmp/docling.log` and redirect all output to it - Remove exit on error trap, now only logs error line numbers - Use temporary directory for timestamp file - Capture Docling exit code and handle errors more gracefully - Update log file references to use `LOG_FILE` variable - Remove local log file during cleanup Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Updating Docling to 2.17.0 Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding README Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: README update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance Dockerfile with additional utilities and env vars - Add installation of `time` and `procps` packages for better resource monitoring. - Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance. - Create a cache directory for EasyOCR to optimize storage usage. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: README update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the Apify FirstPromoter integration Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the "Run on Apify" button Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fixing example PDF document URLs Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding input document URL validation Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix quoting in `DOC_CONVERT_CMD` variable Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add specific error codes for better error handling - `ERR_INVALID_INPUT` for missing document URL - `ERR_URL_INACCESSIBLE` for inaccessible URLs - `ERR_DOCLING_FAILED` for Docling command failures - `ERR_OUTPUT_MISSING` for missing or empty output files - `ERR_STORAGE_FAILED` for failures in storing the output document Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance error handling and data logging - Add `apify pushData` calls to log errors when the document URL is missing or inaccessible. - Introduce dataset record creation with processing results, including a success status and output file URL. - Modify completion message to indicate successful processing and provide a link to the results. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Normalize key-value store terminology Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance `README.md` with output details Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding CHANGELOG.md Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding dataset schema Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update README with output URL details Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix the Apify call syntax and final result URL message Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add section on Actors to README Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Replace Docling CLI with docling-serve API This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include: - Redesign Dockerfile to use docling-serve as base image - Update actor.sh to communicate with API instead of running CLI commands - Improve content type handling for various output formats - Update input schema to align with API parameters - Reduce Docker image size from ~6GB to ~600MB - Update documentation and changelog to reflect architectural changes The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities. Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Overhaul the implementation using official docling-serve image This commit completely revamps the Actor implementation with two major improvements: 1) CRITICAL CHANGE: Switch to official docling-serve image * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image * Eliminates need for custom docling installation * Ensures compatibility with latest docling-serve features * Provides more reliable and consistent document processing 2) Fix Apify Actor KVS storage issues: * Standardize key names to follow Apify conventions: - Change "OUTPUT_RESULT" to "OUTPUT" - Change "DOCLING_LOG" to "LOG" * Add proper multi-stage Docker build: - First stage builds dependencies including apify-cli - Second stage uses official image and adds only necessary tools * Fix permission issues in Docker container: - Set up proper user and directory permissions - Create writable directories for temporary files and models - Configure environment variables for proper execution 3) Solve EACCES permission errors during CLI version checks: * Create temporary HOME directory with proper write permissions * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable * Add NODE_OPTIONS="--no-warnings" to suppress update checks * Support --no-update-notifier CLI flag when available 4) Improve code organization and reliability: * Create reusable upload_to_kvs() function for all KVS operations * Ensure log files are uploaded before tools directory is removed * Set proper MIME types based on output format * Add detailed error reporting and proper cleanup * Display final output URLs for easy verification This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Refactor `actor.sh` and add `docling_processor.py` Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update CHANGELOG and README for Docker and API changes Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Removing obsolete actor.json keys Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fixed input getter Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Always output a zip Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Resolving conflicts with main Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Resolving conflicts with main (pass 2) Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Updated main Readme and Actor Readme Signed-off-by: Adam Kliment <adam@netmilk.net> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Adam Kliment <adam@netmilk.net> Signed-off-by: Václav Vančura <commit@vancura.dev> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Adam Kliment <adam@netmilk.net>
This commit is contained in:
419
.actor/actor.sh
Executable file
419
.actor/actor.sh
Executable file
@@ -0,0 +1,419 @@
|
||||
#!/bin/bash
|
||||
|
||||
export PATH=$PATH:/build-files/node_modules/.bin
|
||||
|
||||
# Function to upload content to the key-value store
|
||||
upload_to_kvs() {
|
||||
local content_file="$1"
|
||||
local key_name="$2"
|
||||
local content_type="$3"
|
||||
local description="$4"
|
||||
|
||||
# Find the Apify CLI command
|
||||
find_apify_cmd
|
||||
local apify_cmd="$FOUND_APIFY_CMD"
|
||||
|
||||
if [ -n "$apify_cmd" ]; then
|
||||
echo "Uploading $description to key-value store (key: $key_name)..."
|
||||
|
||||
# Create a temporary home directory with write permissions
|
||||
setup_temp_environment
|
||||
|
||||
# Use the --no-update-notifier flag if available
|
||||
if $apify_cmd --help | grep -q "\--no-update-notifier"; then
|
||||
if $apify_cmd --no-update-notifier actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
|
||||
echo "Successfully uploaded $description to key-value store"
|
||||
local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
|
||||
echo "$description available at: $url"
|
||||
cleanup_temp_environment
|
||||
return 0
|
||||
fi
|
||||
else
|
||||
# Fall back to regular command if flag isn't available
|
||||
if $apify_cmd actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
|
||||
echo "Successfully uploaded $description to key-value store"
|
||||
local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
|
||||
echo "$description available at: $url"
|
||||
cleanup_temp_environment
|
||||
return 0
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "ERROR: Failed to upload $description to key-value store"
|
||||
cleanup_temp_environment
|
||||
return 1
|
||||
else
|
||||
echo "ERROR: Apify CLI not found for $description upload"
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to find Apify CLI command
|
||||
find_apify_cmd() {
|
||||
FOUND_APIFY_CMD=""
|
||||
for cmd in "apify" "actor" "/usr/local/bin/apify" "/usr/bin/apify" "/opt/apify/cli/bin/apify"; do
|
||||
if command -v "$cmd" &> /dev/null; then
|
||||
FOUND_APIFY_CMD="$cmd"
|
||||
break
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Function to set up temporary environment for Apify CLI
|
||||
setup_temp_environment() {
|
||||
export TMPDIR="/tmp/apify-home-${RANDOM}"
|
||||
mkdir -p "$TMPDIR"
|
||||
export APIFY_DISABLE_VERSION_CHECK=1
|
||||
export NODE_OPTIONS="--no-warnings"
|
||||
export HOME="$TMPDIR" # Override home directory to writable location
|
||||
}
|
||||
|
||||
# Function to clean up temporary environment
|
||||
cleanup_temp_environment() {
|
||||
rm -rf "$TMPDIR" 2>/dev/null || true
|
||||
}
|
||||
|
||||
# Function to push data to Apify dataset
|
||||
push_to_dataset() {
|
||||
# Example usage: push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
|
||||
|
||||
local result_url="$1"
|
||||
local size="$2"
|
||||
local format="$3"
|
||||
|
||||
# Find Apify CLI command
|
||||
find_apify_cmd
|
||||
local apify_cmd="$FOUND_APIFY_CMD"
|
||||
|
||||
if [ -n "$apify_cmd" ]; then
|
||||
echo "Adding record to dataset..."
|
||||
setup_temp_environment
|
||||
|
||||
# Use the --no-update-notifier flag if available
|
||||
if $apify_cmd --help | grep -q "\--no-update-notifier"; then
|
||||
if $apify_cmd --no-update-notifier actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
|
||||
echo "Successfully added record to dataset"
|
||||
else
|
||||
echo "Warning: Failed to add record to dataset"
|
||||
fi
|
||||
else
|
||||
# Fall back to regular command
|
||||
if $apify_cmd actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
|
||||
echo "Successfully added record to dataset"
|
||||
else
|
||||
echo "Warning: Failed to add record to dataset"
|
||||
fi
|
||||
fi
|
||||
|
||||
cleanup_temp_environment
|
||||
fi
|
||||
}
|
||||
|
||||
|
||||
# --- Setup logging and error handling ---
|
||||
|
||||
LOG_FILE="/tmp/docling.log"
|
||||
touch "$LOG_FILE" || {
|
||||
echo "Fatal: Cannot create log file at $LOG_FILE"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Log to both console and file
|
||||
exec 1> >(tee -a "$LOG_FILE")
|
||||
exec 2> >(tee -a "$LOG_FILE" >&2)
|
||||
|
||||
# Exit codes
|
||||
readonly ERR_API_UNAVAILABLE=15
|
||||
readonly ERR_INVALID_INPUT=16
|
||||
|
||||
|
||||
# --- Debug environment ---
|
||||
|
||||
echo "Date: $(date)"
|
||||
echo "Python version: $(python --version 2>&1)"
|
||||
echo "Docling-serve path: $(which docling-serve 2>/dev/null || echo 'Not found')"
|
||||
echo "Working directory: $(pwd)"
|
||||
|
||||
# --- Get input ---
|
||||
|
||||
echo "Getting Apify Actor Input"
|
||||
INPUT=$(apify actor get-input 2>/dev/null)
|
||||
|
||||
# --- Setup tools ---
|
||||
|
||||
echo "Setting up tools..."
|
||||
TOOLS_DIR="/tmp/docling-tools"
|
||||
mkdir -p "$TOOLS_DIR"
|
||||
|
||||
# Copy tools if available
|
||||
if [ -d "/build-files" ]; then
|
||||
echo "Copying tools from /build-files..."
|
||||
cp -r /build-files/* "$TOOLS_DIR/"
|
||||
export PATH="$TOOLS_DIR/bin:$PATH"
|
||||
else
|
||||
echo "Warning: No build files directory found. Some tools may be unavailable."
|
||||
fi
|
||||
|
||||
# Copy Python processor script to tools directory
|
||||
PYTHON_SCRIPT_PATH="$(dirname "$0")/docling_processor.py"
|
||||
if [ -f "$PYTHON_SCRIPT_PATH" ]; then
|
||||
echo "Copying Python processor script to tools directory..."
|
||||
cp "$PYTHON_SCRIPT_PATH" "$TOOLS_DIR/"
|
||||
chmod +x "$TOOLS_DIR/docling_processor.py"
|
||||
else
|
||||
echo "ERROR: Python processor script not found at $PYTHON_SCRIPT_PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check OCR directories and ensure they're writable
|
||||
echo "Checking OCR directory permissions..."
|
||||
OCR_DIR="/opt/app-root/src/.EasyOCR"
|
||||
if [ -d "$OCR_DIR" ]; then
|
||||
# Test if we can write to the directory
|
||||
if touch "$OCR_DIR/test_write" 2>/dev/null; then
|
||||
echo "[✓] OCR directory is writable"
|
||||
rm "$OCR_DIR/test_write"
|
||||
else
|
||||
echo "[✗] OCR directory is not writable, setting up alternative in /tmp"
|
||||
|
||||
# Create alternative in /tmp (which is writable)
|
||||
mkdir -p "/tmp/.EasyOCR/user_network"
|
||||
export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
|
||||
fi
|
||||
else
|
||||
echo "OCR directory not found, creating in /tmp"
|
||||
mkdir -p "/tmp/.EasyOCR/user_network"
|
||||
export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
|
||||
fi
|
||||
|
||||
|
||||
# --- Starting the API ---
|
||||
|
||||
echo "Starting docling-serve API..."
|
||||
|
||||
# Create a dedicated working directory in /tmp (writable)
|
||||
API_DIR="/tmp/docling-api"
|
||||
mkdir -p "$API_DIR"
|
||||
cd "$API_DIR"
|
||||
echo "API working directory: $(pwd)"
|
||||
|
||||
# Find docling-serve executable
|
||||
DOCLING_SERVE_PATH=$(which docling-serve)
|
||||
echo "Docling-serve executable: $DOCLING_SERVE_PATH"
|
||||
|
||||
# Start the API with minimal parameters to avoid any issues
|
||||
echo "Starting docling-serve API..."
|
||||
"$DOCLING_SERVE_PATH" run --host 0.0.0.0 --port 5001 > "$API_DIR/docling-serve.log" 2>&1 &
|
||||
API_PID=$!
|
||||
echo "Started docling-serve API with PID: $API_PID"
|
||||
|
||||
# A more reliable wait for API startup
|
||||
echo "Waiting for API to initialize..."
|
||||
MAX_TRIES=30
|
||||
tries=0
|
||||
started=false
|
||||
|
||||
while [ $tries -lt $MAX_TRIES ]; do
|
||||
tries=$((tries + 1))
|
||||
|
||||
# Check if process is still running
|
||||
if ! ps -p $API_PID > /dev/null; then
|
||||
echo "ERROR: docling-serve API process terminated unexpectedly after $tries seconds"
|
||||
break
|
||||
fi
|
||||
|
||||
# Check log for startup completion or errors
|
||||
if grep -q "Application startup complete" "$API_DIR/docling-serve.log" 2>/dev/null; then
|
||||
echo "[✓] API startup completed successfully after $tries seconds"
|
||||
started=true
|
||||
break
|
||||
fi
|
||||
|
||||
if grep -q "Permission denied\|PermissionError" "$API_DIR/docling-serve.log" 2>/dev/null; then
|
||||
echo "ERROR: Permission errors detected in API startup"
|
||||
break
|
||||
fi
|
||||
|
||||
# Sleep and check again
|
||||
sleep 1
|
||||
|
||||
# Output a progress indicator every 5 seconds
|
||||
if [ $((tries % 5)) -eq 0 ]; then
|
||||
echo "Still waiting for API startup... ($tries/$MAX_TRIES seconds)"
|
||||
fi
|
||||
done
|
||||
|
||||
# Show log content regardless of outcome
|
||||
echo "docling-serve log output so far:"
|
||||
tail -n 20 "$API_DIR/docling-serve.log"
|
||||
|
||||
# Verify the API is running
|
||||
if ! ps -p $API_PID > /dev/null; then
|
||||
echo "ERROR: docling-serve API failed to start"
|
||||
if [ -f "$API_DIR/docling-serve.log" ]; then
|
||||
echo "Full log output:"
|
||||
cat "$API_DIR/docling-serve.log"
|
||||
fi
|
||||
exit $ERR_API_UNAVAILABLE
|
||||
fi
|
||||
|
||||
if [ "$started" != "true" ]; then
|
||||
echo "WARNING: API process is running but startup completion was not detected"
|
||||
echo "Will attempt to continue anyway..."
|
||||
fi
|
||||
|
||||
# Try to verify API is responding at this point
|
||||
echo "Verifying API responsiveness..."
|
||||
(python -c "
|
||||
import sys, time, socket
|
||||
for i in range(5):
|
||||
try:
|
||||
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
s.settimeout(1)
|
||||
result = s.connect_ex(('localhost', 5001))
|
||||
if result == 0:
|
||||
s.close()
|
||||
print('Port 5001 is open and accepting connections')
|
||||
sys.exit(0)
|
||||
s.close()
|
||||
except Exception as e:
|
||||
pass
|
||||
time.sleep(1)
|
||||
print('Could not connect to API port after 5 attempts')
|
||||
sys.exit(1)
|
||||
" && echo "API verification succeeded") || echo "API verification failed, but continuing anyway"
|
||||
|
||||
# Define API endpoint
|
||||
DOCLING_API_ENDPOINT="http://localhost:5001/v1alpha/convert/source"
|
||||
|
||||
|
||||
# --- Processing document ---
|
||||
|
||||
echo "Starting document processing..."
|
||||
echo "Reading input from Apify..."
|
||||
|
||||
echo "Input content:" >&2
|
||||
echo "$INPUT" >&2 # Send the raw input to stderr for debugging
|
||||
echo "$INPUT" # Send the clean JSON to stdout for processing
|
||||
|
||||
# Create the request JSON
|
||||
|
||||
REQUEST_JSON=$(echo $INPUT | jq '.options += {"return_as_file": true}')
|
||||
|
||||
echo "Creating request JSON:" >&2
|
||||
echo "$REQUEST_JSON" >&2
|
||||
echo "$REQUEST_JSON" > "$API_DIR/request.json"
|
||||
|
||||
|
||||
# Send the conversion request using our Python script
|
||||
#echo "Sending conversion request to docling-serve API..."
|
||||
#python "$TOOLS_DIR/docling_processor.py" \
|
||||
# --api-endpoint "$DOCLING_API_ENDPOINT" \
|
||||
# --request-json "$API_DIR/request.json" \
|
||||
# --output-dir "$API_DIR" \
|
||||
# --output-format "$OUTPUT_FORMAT"
|
||||
|
||||
echo "Curl the Docling API"
|
||||
curl -s -H "content-type: application/json" -X POST --data-binary @$API_DIR/request.json -o $API_DIR/output.zip $DOCLING_API_ENDPOINT
|
||||
|
||||
CURL_EXIT_CODE=$?
|
||||
|
||||
# --- Check for various potential output files ---
|
||||
|
||||
echo "Checking for output files..."
|
||||
if [ -f "$API_DIR/output.zip" ]; then
|
||||
echo "Conversion completed successfully! Output file found."
|
||||
|
||||
# Get content from the converted file
|
||||
OUTPUT_SIZE=$(wc -c < "$API_DIR/output.zip")
|
||||
echo "Output file found with size: $OUTPUT_SIZE bytes"
|
||||
|
||||
# Calculate the access URL for result display
|
||||
RESULT_URL="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/OUTPUT"
|
||||
|
||||
echo "=============================="
|
||||
echo "PROCESSING COMPLETE!"
|
||||
echo "Output size: ${OUTPUT_SIZE} bytes"
|
||||
echo "=============================="
|
||||
|
||||
# Set the output content type based on format
|
||||
CONTENT_TYPE="application/zip"
|
||||
|
||||
# Upload the document content using our function
|
||||
upload_to_kvs "$API_DIR/output.zip" "OUTPUT" "$CONTENT_TYPE" "Document content"
|
||||
|
||||
# Only proceed with dataset record if document upload succeeded
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "Your document is available at: ${RESULT_URL}"
|
||||
echo "=============================="
|
||||
|
||||
# Push data to dataset
|
||||
push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
|
||||
fi
|
||||
else
|
||||
echo "ERROR: No converted output file found at $API_DIR/output.zip"
|
||||
|
||||
# Create error metadata
|
||||
ERROR_METADATA="{\"status\":\"error\",\"error\":\"No converted output file found\",\"documentUrl\":\"$DOCUMENT_URL\"}"
|
||||
echo "$ERROR_METADATA" > "/tmp/actor-output/OUTPUT"
|
||||
chmod 644 "/tmp/actor-output/OUTPUT"
|
||||
|
||||
echo "Error information has been saved to /tmp/actor-output/OUTPUT"
|
||||
fi
|
||||
|
||||
|
||||
# --- Verify output files for debugging ---
|
||||
|
||||
echo "=== Final Output Verification ==="
|
||||
echo "Files in /tmp/actor-output:"
|
||||
ls -la /tmp/actor-output/ 2>/dev/null || echo "Cannot list /tmp/actor-output/"
|
||||
|
||||
echo "All operations completed. The output should be available in the default key-value store."
|
||||
echo "Content URL: ${RESULT_URL:-No URL available}"
|
||||
|
||||
|
||||
# --- Cleanup function ---
|
||||
|
||||
cleanup() {
|
||||
echo "Running cleanup..."
|
||||
|
||||
# Stop the API process
|
||||
if [ -n "$API_PID" ]; then
|
||||
echo "Stopping docling-serve API (PID: $API_PID)..."
|
||||
kill $API_PID 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# Export log file to KVS if it exists
|
||||
# DO THIS BEFORE REMOVING TOOLS DIRECTORY
|
||||
if [ -f "$LOG_FILE" ]; then
|
||||
if [ -s "$LOG_FILE" ]; then
|
||||
echo "Log file is not empty, pushing to key-value store (key: LOG)..."
|
||||
|
||||
# Upload log using our function
|
||||
upload_to_kvs "$LOG_FILE" "LOG" "text/plain" "Log file"
|
||||
else
|
||||
echo "Warning: log file exists but is empty"
|
||||
fi
|
||||
else
|
||||
echo "Warning: No log file found"
|
||||
fi
|
||||
|
||||
# Clean up temporary files AFTER log is uploaded
|
||||
echo "Cleaning up temporary files..."
|
||||
if [ -d "$API_DIR" ]; then
|
||||
echo "Removing API working directory: $API_DIR"
|
||||
rm -rf "$API_DIR" 2>/dev/null || echo "Warning: Failed to remove $API_DIR"
|
||||
fi
|
||||
|
||||
if [ -d "$TOOLS_DIR" ]; then
|
||||
echo "Removing tools directory: $TOOLS_DIR"
|
||||
rm -rf "$TOOLS_DIR" 2>/dev/null || echo "Warning: Failed to remove $TOOLS_DIR"
|
||||
fi
|
||||
|
||||
# Keep log file until the very end
|
||||
echo "Script execution completed at $(date)"
|
||||
echo "Actor execution completed"
|
||||
}
|
||||
|
||||
# Register cleanup
|
||||
trap cleanup EXIT
|
||||
Reference in New Issue
Block a user