Actor: Update CHANGELOG and README for Docker and API changes

Signed-off-by: Václav Vančura <commit@vancura.dev>
This commit is contained in:
Václav Vančura 2025-03-09 16:07:17 +01:00 committed by Adam Kliment
parent 5f5c0a9d50
commit 72077c109d
3 changed files with 63 additions and 26 deletions

View File

@ -5,20 +5,38 @@ All notable changes to the Docling Actor will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [1.1.0] - 2025-03-15
## [1.1.0] - 2025-03-09
### Changed
- Switched from full Docling CLI to docling-serve API
- Dramatically reduced Docker image size (from ~6GB to ~600MB)
- Using the official quay.io/ds4sd/docling-serve-cpu Docker image
- Reduced Docker image size (from ~6GB to ~4GB)
- Implemented multi-stage Docker build to handle dependencies
- Improved Docker build process to ensure compatibility with docling-serve-cpu image
- Added new Python processor script for reliable API communication and content extraction
- Enhanced response handling with better content extraction logic
- Fixed ES modules compatibility issue with Apify CLI
- Added explicit tmpfs volume for temporary files
- Fixed environment variables format in actor.json
- Created optimized dependency installation approach
- Improved API compatibility with docling-serve
- Updated endpoint from custom `/convert` to standard `/v1alpha/convert/source`
- Revised JSON payload structure to match docling-serve API format
- Added proper output field parsing based on format
- Enhanced startup process with health checks
- Added configurable API host and port through environment variables
- Better content type handling for different output formats
- Updated error handling to align with API responses
### Fixed
- Fixed actor input file conflict in get_actor_input(): now checks for and removes an existing /tmp/actor-input/INPUT directory if found, ensuring valid JSON input parsing.
### Technical Details
- Actor Specification v1
- Using ds4sd/docling-serve:latest base image
- Using quay.io/ds4sd/docling-serve-cpu:latest base image
- Node.js 20.x for Apify CLI
- Eliminated Python dependencies
- Simplified Docker build process

View File

@ -24,7 +24,7 @@ This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.i
## Features
- Leverages the lightweight docling-serve API for efficient document processing
- Leverages the official docling-serve-cpu Docker image for efficient document processing
- Processes multiple document formats:
- PDF documents (scanned or digital)
- Microsoft Office files (DOCX, XLSX, PPTX)
@ -49,7 +49,7 @@ This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.i
- The URL of the document.
- Output format (`md`, `json`, `html`, `text`, or `doctags`).
- OCR boolean toggle.
4. The Actor will run and produce its outputs in the default key-value store under the key `OUTPUT_RESULT`.
4. The Actor will run and produce its outputs in the default key-value store under the key `OUTPUT`.
### Using Apify API
@ -102,7 +102,7 @@ The Actor provides three types of outputs:
1. **Processed Document** - The Actor will provide the direct URL to your result in the run log, looking like:
```text
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT_RESULT'
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'
```
2. **Processing Log** - Available in the key-value store as `DOCLING_LOG`
@ -117,13 +117,13 @@ You can access the results in several ways:
1. **Direct URL** (shown in Actor run logs):
```text
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT_RESULT
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT
```
2. **Programmatically** via Apify CLI:
```bash
apify key-value-stores get-value OUTPUT_RESULT
apify key-value-stores get-value OUTPUT
```
3. **Dataset** - Check the "Dataset" tab in the Actor run details to see processing metadata
@ -182,7 +182,7 @@ apify key-value-stores get-record DOCLING_LOG
## Performance & Resources
- **Docker Image Size**: ~600 MB
- **Docker Image Size**: ~4GB
- **Memory Requirements**:
- Minimum: 2 GB RAM
- Recommended: 4 GB RAM for large or complex documents
@ -234,8 +234,12 @@ If you wish to develop or modify this Actor locally:
3. The Actor files are located in the `.actor` directory:
- `Dockerfile` - Defines the container environment
- `actor.json` - Actor configuration and metadata
- `actor.sh` - Main execution script
- `actor.sh` - Main execution script that starts the docling-serve API and orchestrates document processing
- `input_schema.json` - Input parameter definitions
- `dataset_schema.json` - Dataset output format definition
- `docling_processor.py` - Python script handling API communication with docling-serve
- `CHANGELOG.md` - Change log documenting all notable changes
- `README.md` - This documentation
4. Run the Actor locally using:
```bash
@ -248,25 +252,40 @@ If you wish to develop or modify this Actor locally:
.actor/
├── Dockerfile # Container definition
├── actor.json # Actor metadata
├── actor.sh # Execution script
├── actor.sh # Execution script (also starts docling-serve API)
├── input_schema.json # Input parameters
├── dataset_schema.json # Dataset output format definition
├── docling_processor.py # Python script for API communication
├── CHANGELOG.md # Version history and changes
└── README.md # This documentation
```
## Architecture
This Actor uses a lightweight architecture based on the official `ds4sd/docling-serve` Docker image:
This Actor uses a lightweight architecture based on the official `quay.io/ds4sd/docling-serve-cpu` Docker image:
- **Base Image**: `ds4sd/docling-serve:latest` (~600MB)
- **API Communication**: Uses the RESTful API provided by docling-serve on port 8080
- **Base Image**: `quay.io/ds4sd/docling-serve-cpu:latest` (~4GB)
- **Multi-Stage Build**: Uses a multi-stage Docker build to include only necessary tools
- **API Communication**: Uses the RESTful API provided by docling-serve
- **Request Flow**:
1. Actor receives the input parameters
2. Creates a JSON payload for the docling-serve API
3. Makes a POST request to the /convert endpoint
4. Processes the response and stores it in the key-value store
1. The actor script starts the docling-serve API on port 5001
2. Performs health checks to ensure the API is running
3. Processes the input parameters
4. Creates a JSON payload for the docling-serve API with proper format:
```json
{
"options": {
"to_formats": ["md"],
"do_ocr": true
},
"http_sources": [{"url": "https://example.com/document.pdf"}]
}
```
5. Makes a POST request to the `/v1alpha/convert/source` endpoint
6. Processes the response and stores it in the key-value store
- **Dependencies**:
- Node.js for Apify CLI
- Essential Linux tools (curl, jq, etc.)
- Essential tools (curl, jq, etc.) copied from build stage
- **Security**: Runs as a non-root user for enhanced security
## License
@ -275,7 +294,7 @@ This wrapper project is under the MIT License, matching the original Docling lic
## Acknowledgments
- [Docling](https://ds4sd.github.io/docling/) and [docling-serve](https://github.com/DS4SD/docling-serve) by IBM
- [Docling](https://ds4sd.github.io/docling/) and [docling-serve-cpu](https://quay.io/repository/ds4sd/docling-serve-cpu) by IBM
- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
## Security Considerations

View File

@ -93,7 +93,7 @@ You can run Docling in the cloud without installation using the [Docling Actor](
```bash
apify call vancura/docling -i '{
"documentUrl": "https://arxiv.org/pdf/2408.09869",
"outputFormat": "markdown",
"outputFormat": "md",
"ocr": true
}'
```