docling/.actor/README.md
Václav Vančura 9f86971fad Actor: Replace Docling CLI with docling-serve API
This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:

- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes

The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.

Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <commit@vancura.dev>
2025-03-13 10:39:22 +01:00

8.5 KiB

Docling Actor on Apify

Docling Actor

This Actor (specification v1) wraps the Docling project to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.

What are Actors?

Actors are serverless microservices running on the Apify Platform. They are based on the Actor SDK and can be found in the Apify Store. Learn more about Actors in the Apify Whitepaper.

Table of Contents

  1. Features
  2. Usage
  3. Input Parameters
  4. Output
  5. Performance & Resources
  6. Troubleshooting
  7. Local Development
  8. Architecture
  9. License
  10. Acknowledgments
  11. Security Considerations

Features

  • Leverages the lightweight docling-serve API for efficient document processing
  • Processes multiple document formats:
    • PDF documents (scanned or digital)
    • Microsoft Office files (DOCX, XLSX, PPTX)
    • Images (PNG, JPG, TIFF)
    • Other text-based formats
  • Provides OCR capabilities for scanned documents
  • Exports to multiple formats:
    • Markdown
    • JSON
    • HTML
    • Plain Text
    • DocTags (structured format)
  • No local setup needed—just provide input via a simple JSON config

Usage

Using Apify Console

  1. Go to the Apify Actor page.
  2. Click "Run".
  3. In the input form, fill in:
    • The URL of the document.
    • Output format (md, json, html, text, or doctags).
    • OCR boolean toggle.
  4. The Actor will run and produce its outputs in the default key-value store under the key OUTPUT_RESULT.

Using Apify API

curl --request POST \
  --url "https://api.apify.com/v2/acts/username~actorname/run" \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer YOUR_API_TOKEN' \
  --data '{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "md",
    "ocr": true
  }'

Using Apify CLI

apify call username/actorname --input='{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "md",
    "ocr": true
}'

Input Parameters

The Actor accepts a JSON schema matching the file .actor/input_schema.json. Below is a summary of the fields:

Field Type Required Default Description
documentUrl string Yes None URL of the document (PDF, image, DOCX, etc.) to be processed. Must be directly accessible via public URL.
outputFormat string No md Desired output format. One of md, json, html, text, or doctags.
ocr boolean No true If set to true, OCR will be applied to scanned PDFs or images for text recognition.

Example Input

{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "md",
    "ocr": false
}

Output

The Actor provides three types of outputs:

  1. Processed Document - The Actor will provide the direct URL to your result in the run log, looking like:

    You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT_RESULT'
    
  2. Processing Log - Available in the key-value store as DOCLING_LOG

  3. Dataset Record - Contains processing metadata with:

    • Input document URL
    • Direct link to the processed output
    • Processing status

You can access the results in several ways:

  1. Direct URL (shown in Actor run logs):
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT_RESULT
  1. Programmatically via Apify CLI:
apify key-value-stores get-value OUTPUT_RESULT
  1. Dataset - Check the "Dataset" tab in the Actor run details to see processing metadata

Example Outputs

Markdown (md)

# Document Title

## Section 1
Content of section 1...

## Section 2
Content of section 2...

JSON

{
    "title": "Document Title",
    "sections": [
        {
            "level": 1,
            "title": "Section 1",
            "content": "Content of section 1..."
        }
    ]
}

HTML

<h1>Document Title</h1>
<h2>Section 1</h2>
<p>Content of section 1...</p>

Processing Logs (DOCLING_LOG)

The Actor maintains detailed processing logs including:

  • API request and response details
  • Processing steps and timing
  • Error messages and stack traces
  • Input validation results

Access logs via:

apify key-value-stores get-record DOCLING_LOG

Performance & Resources

  • Docker Image Size: ~600 MB
  • Memory Requirements:
    • Minimum: 2 GB RAM
    • Recommended: 4 GB RAM for large or complex documents
  • Processing Time:
    • Simple documents: 15-30 seconds
    • Complex PDFs with OCR: 1-3 minutes
    • Large documents (100+ pages): 3-10 minutes

Troubleshooting

Common issues and solutions:

  1. Document URL Not Accessible

    • Ensure the URL is publicly accessible
    • Check if the document requires authentication
    • Verify the URL leads directly to the document
  2. OCR Processing Fails

    • Verify the document is not password-protected
    • Check if the image quality is sufficient
    • Try processing with OCR disabled
  3. API Response Issues

    • Check the logs for detailed error messages
    • Ensure the document format is supported
    • Verify the URL is correctly formatted
  4. Output Format Issues

    • Verify the output format is supported
    • Check if the document structure is compatible
    • Review the DOCLING_LOG for specific errors

Error Handling

The Actor implements comprehensive error handling:

  • Input validation for document URLs and parameters
  • Detailed error messages in DOCLING_LOG
  • Proper exit codes for different failure scenarios
  • Automatic cleanup on failure
  • Dataset records with processing status

Local Development

If you wish to develop or modify this Actor locally:

  1. Clone the repository.

  2. Ensure Docker is installed.

  3. The Actor files are located in the .actor directory:

    • Dockerfile - Defines the container environment
    • actor.json - Actor configuration and metadata
    • actor.sh - Main execution script
    • input_schema.json - Input parameter definitions
  4. Run the Actor locally using:

    apify run
    

Actor Structure

.actor/
├── Dockerfile          # Container definition
├── actor.json          # Actor metadata
├── actor.sh            # Execution script
├── input_schema.json   # Input parameters
└── README.md           # This documentation

Architecture

This Actor uses a lightweight architecture based on the official ds4sd/docling-serve Docker image:

  • Base Image: ds4sd/docling-serve:latest (~600MB)
  • API Communication: Uses the RESTful API provided by docling-serve on port 8080
  • Request Flow:
    1. Actor receives the input parameters
    2. Creates a JSON payload for the docling-serve API
    3. Makes a POST request to the /convert endpoint
    4. Processes the response and stores it in the key-value store
  • Dependencies:
    • Node.js for Apify CLI
    • Essential Linux tools (curl, jq, etc.)
  • Security: Runs as a non-root user for enhanced security

License

This wrapper project is under the MIT License, matching the original Docling license. See LICENSE for details.

Acknowledgments

Security Considerations

  • Actor runs under a non-root user for enhanced security
  • Input URLs are validated before processing
  • Temporary files are securely managed and cleaned up
  • Process isolation through Docker containerization
  • Secure handling of processing artifacts