mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-30 14:04:27 +00:00
Actor: Update CHANGELOG and README for Docker and API changes
Signed-off-by: Václav Vančura <commit@vancura.dev>
This commit is contained in:
parent
5f5c0a9d50
commit
72077c109d
@ -5,20 +5,38 @@ All notable changes to the Docling Actor will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
## [1.1.0] - 2025-03-15
|
## [1.1.0] - 2025-03-09
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
|
||||||
- Switched from full Docling CLI to docling-serve API
|
- Switched from full Docling CLI to docling-serve API
|
||||||
- Dramatically reduced Docker image size (from ~6GB to ~600MB)
|
- Using the official quay.io/ds4sd/docling-serve-cpu Docker image
|
||||||
|
- Reduced Docker image size (from ~6GB to ~4GB)
|
||||||
|
- Implemented multi-stage Docker build to handle dependencies
|
||||||
|
- Improved Docker build process to ensure compatibility with docling-serve-cpu image
|
||||||
|
- Added new Python processor script for reliable API communication and content extraction
|
||||||
|
- Enhanced response handling with better content extraction logic
|
||||||
|
- Fixed ES modules compatibility issue with Apify CLI
|
||||||
|
- Added explicit tmpfs volume for temporary files
|
||||||
|
- Fixed environment variables format in actor.json
|
||||||
|
- Created optimized dependency installation approach
|
||||||
- Improved API compatibility with docling-serve
|
- Improved API compatibility with docling-serve
|
||||||
|
- Updated endpoint from custom `/convert` to standard `/v1alpha/convert/source`
|
||||||
|
- Revised JSON payload structure to match docling-serve API format
|
||||||
|
- Added proper output field parsing based on format
|
||||||
|
- Enhanced startup process with health checks
|
||||||
|
- Added configurable API host and port through environment variables
|
||||||
- Better content type handling for different output formats
|
- Better content type handling for different output formats
|
||||||
- Updated error handling to align with API responses
|
- Updated error handling to align with API responses
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
|
||||||
|
- Fixed actor input file conflict in get_actor_input(): now checks for and removes an existing /tmp/actor-input/INPUT directory if found, ensuring valid JSON input parsing.
|
||||||
|
|
||||||
### Technical Details
|
### Technical Details
|
||||||
|
|
||||||
- Actor Specification v1
|
- Actor Specification v1
|
||||||
- Using ds4sd/docling-serve:latest base image
|
- Using quay.io/ds4sd/docling-serve-cpu:latest base image
|
||||||
- Node.js 20.x for Apify CLI
|
- Node.js 20.x for Apify CLI
|
||||||
- Eliminated Python dependencies
|
- Eliminated Python dependencies
|
||||||
- Simplified Docker build process
|
- Simplified Docker build process
|
||||||
|
@ -24,7 +24,7 @@ This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.i
|
|||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
- Leverages the lightweight docling-serve API for efficient document processing
|
- Leverages the official docling-serve-cpu Docker image for efficient document processing
|
||||||
- Processes multiple document formats:
|
- Processes multiple document formats:
|
||||||
- PDF documents (scanned or digital)
|
- PDF documents (scanned or digital)
|
||||||
- Microsoft Office files (DOCX, XLSX, PPTX)
|
- Microsoft Office files (DOCX, XLSX, PPTX)
|
||||||
@ -49,7 +49,7 @@ This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.i
|
|||||||
- The URL of the document.
|
- The URL of the document.
|
||||||
- Output format (`md`, `json`, `html`, `text`, or `doctags`).
|
- Output format (`md`, `json`, `html`, `text`, or `doctags`).
|
||||||
- OCR boolean toggle.
|
- OCR boolean toggle.
|
||||||
4. The Actor will run and produce its outputs in the default key-value store under the key `OUTPUT_RESULT`.
|
4. The Actor will run and produce its outputs in the default key-value store under the key `OUTPUT`.
|
||||||
|
|
||||||
### Using Apify API
|
### Using Apify API
|
||||||
|
|
||||||
@ -102,7 +102,7 @@ The Actor provides three types of outputs:
|
|||||||
1. **Processed Document** - The Actor will provide the direct URL to your result in the run log, looking like:
|
1. **Processed Document** - The Actor will provide the direct URL to your result in the run log, looking like:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT_RESULT'
|
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Processing Log** - Available in the key-value store as `DOCLING_LOG`
|
2. **Processing Log** - Available in the key-value store as `DOCLING_LOG`
|
||||||
@ -117,13 +117,13 @@ You can access the results in several ways:
|
|||||||
1. **Direct URL** (shown in Actor run logs):
|
1. **Direct URL** (shown in Actor run logs):
|
||||||
|
|
||||||
```text
|
```text
|
||||||
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT_RESULT
|
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Programmatically** via Apify CLI:
|
2. **Programmatically** via Apify CLI:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
apify key-value-stores get-value OUTPUT_RESULT
|
apify key-value-stores get-value OUTPUT
|
||||||
```
|
```
|
||||||
|
|
||||||
3. **Dataset** - Check the "Dataset" tab in the Actor run details to see processing metadata
|
3. **Dataset** - Check the "Dataset" tab in the Actor run details to see processing metadata
|
||||||
@ -182,7 +182,7 @@ apify key-value-stores get-record DOCLING_LOG
|
|||||||
|
|
||||||
## Performance & Resources
|
## Performance & Resources
|
||||||
|
|
||||||
- **Docker Image Size**: ~600 MB
|
- **Docker Image Size**: ~4GB
|
||||||
- **Memory Requirements**:
|
- **Memory Requirements**:
|
||||||
- Minimum: 2 GB RAM
|
- Minimum: 2 GB RAM
|
||||||
- Recommended: 4 GB RAM for large or complex documents
|
- Recommended: 4 GB RAM for large or complex documents
|
||||||
@ -234,8 +234,12 @@ If you wish to develop or modify this Actor locally:
|
|||||||
3. The Actor files are located in the `.actor` directory:
|
3. The Actor files are located in the `.actor` directory:
|
||||||
- `Dockerfile` - Defines the container environment
|
- `Dockerfile` - Defines the container environment
|
||||||
- `actor.json` - Actor configuration and metadata
|
- `actor.json` - Actor configuration and metadata
|
||||||
- `actor.sh` - Main execution script
|
- `actor.sh` - Main execution script that starts the docling-serve API and orchestrates document processing
|
||||||
- `input_schema.json` - Input parameter definitions
|
- `input_schema.json` - Input parameter definitions
|
||||||
|
- `dataset_schema.json` - Dataset output format definition
|
||||||
|
- `docling_processor.py` - Python script handling API communication with docling-serve
|
||||||
|
- `CHANGELOG.md` - Change log documenting all notable changes
|
||||||
|
- `README.md` - This documentation
|
||||||
4. Run the Actor locally using:
|
4. Run the Actor locally using:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -246,27 +250,42 @@ If you wish to develop or modify this Actor locally:
|
|||||||
|
|
||||||
```text
|
```text
|
||||||
.actor/
|
.actor/
|
||||||
├── Dockerfile # Container definition
|
├── Dockerfile # Container definition
|
||||||
├── actor.json # Actor metadata
|
├── actor.json # Actor metadata
|
||||||
├── actor.sh # Execution script
|
├── actor.sh # Execution script (also starts docling-serve API)
|
||||||
├── input_schema.json # Input parameters
|
├── input_schema.json # Input parameters
|
||||||
└── README.md # This documentation
|
├── dataset_schema.json # Dataset output format definition
|
||||||
|
├── docling_processor.py # Python script for API communication
|
||||||
|
├── CHANGELOG.md # Version history and changes
|
||||||
|
└── README.md # This documentation
|
||||||
```
|
```
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
This Actor uses a lightweight architecture based on the official `ds4sd/docling-serve` Docker image:
|
This Actor uses a lightweight architecture based on the official `quay.io/ds4sd/docling-serve-cpu` Docker image:
|
||||||
|
|
||||||
- **Base Image**: `ds4sd/docling-serve:latest` (~600MB)
|
- **Base Image**: `quay.io/ds4sd/docling-serve-cpu:latest` (~4GB)
|
||||||
- **API Communication**: Uses the RESTful API provided by docling-serve on port 8080
|
- **Multi-Stage Build**: Uses a multi-stage Docker build to include only necessary tools
|
||||||
|
- **API Communication**: Uses the RESTful API provided by docling-serve
|
||||||
- **Request Flow**:
|
- **Request Flow**:
|
||||||
1. Actor receives the input parameters
|
1. The actor script starts the docling-serve API on port 5001
|
||||||
2. Creates a JSON payload for the docling-serve API
|
2. Performs health checks to ensure the API is running
|
||||||
3. Makes a POST request to the /convert endpoint
|
3. Processes the input parameters
|
||||||
4. Processes the response and stores it in the key-value store
|
4. Creates a JSON payload for the docling-serve API with proper format:
|
||||||
- **Dependencies**:
|
```json
|
||||||
|
{
|
||||||
|
"options": {
|
||||||
|
"to_formats": ["md"],
|
||||||
|
"do_ocr": true
|
||||||
|
},
|
||||||
|
"http_sources": [{"url": "https://example.com/document.pdf"}]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
5. Makes a POST request to the `/v1alpha/convert/source` endpoint
|
||||||
|
6. Processes the response and stores it in the key-value store
|
||||||
|
- **Dependencies**:
|
||||||
- Node.js for Apify CLI
|
- Node.js for Apify CLI
|
||||||
- Essential Linux tools (curl, jq, etc.)
|
- Essential tools (curl, jq, etc.) copied from build stage
|
||||||
- **Security**: Runs as a non-root user for enhanced security
|
- **Security**: Runs as a non-root user for enhanced security
|
||||||
|
|
||||||
## License
|
## License
|
||||||
@ -275,7 +294,7 @@ This wrapper project is under the MIT License, matching the original Docling lic
|
|||||||
|
|
||||||
## Acknowledgments
|
## Acknowledgments
|
||||||
|
|
||||||
- [Docling](https://ds4sd.github.io/docling/) and [docling-serve](https://github.com/DS4SD/docling-serve) by IBM
|
- [Docling](https://ds4sd.github.io/docling/) and [docling-serve-cpu](https://quay.io/repository/ds4sd/docling-serve-cpu) by IBM
|
||||||
- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
|
- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
|
||||||
|
|
||||||
## Security Considerations
|
## Security Considerations
|
||||||
|
@ -93,7 +93,7 @@ You can run Docling in the cloud without installation using the [Docling Actor](
|
|||||||
```bash
|
```bash
|
||||||
apify call vancura/docling -i '{
|
apify call vancura/docling -i '{
|
||||||
"documentUrl": "https://arxiv.org/pdf/2408.09869",
|
"documentUrl": "https://arxiv.org/pdf/2408.09869",
|
||||||
"outputFormat": "markdown",
|
"outputFormat": "md",
|
||||||
"ocr": true
|
"ocr": true
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
Loading…
Reference in New Issue
Block a user