Actor: Updated main Readme and Actor Readme

Signed-off-by: Adam Kliment <adam@netmilk.net>
This commit is contained in:
Adam Kliment 2025-03-13 14:07:39 +01:00
parent 53837fe30e
commit 6a9d041bfa
2 changed files with 37 additions and 25 deletions

View File

@ -55,23 +55,31 @@ This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.i
```bash ```bash
curl --request POST \ curl --request POST \
--url "https://api.apify.com/v2/acts/username~actorname/run" \ --url "https://api.apify.com/v2/acts/vancura~docling/run" \
--header 'Content-Type: application/json' \ --header 'Content-Type: application/json' \
--header 'Authorization: Bearer YOUR_API_TOKEN' \ --header 'Authorization: Bearer YOUR_API_TOKEN' \
--data '{ --data '{
"documentUrl": "https://arxiv.org/pdf/2408.09869.pdf", "options": {
"outputFormat": "md", "to_formats": ["md", "json", "html", "text", "doctags"]
"ocr": true },
}' "http_sources": [
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
{"url": "https://arxiv.org/pdf/2408.09869"}
]
}'
``` ```
### Using Apify CLI ### Using Apify CLI
```bash ```bash
apify call username/actorname --input='{ apify call vancura/docling --input='{
"documentUrl": "https://arxiv.org/pdf/2408.09869.pdf", "options": {
"outputFormat": "md", "to_formats": ["md", "json", "html", "text", "doctags"]
"ocr": true },
"http_sources": [
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
{"url": "https://arxiv.org/pdf/2408.09869"}
]
}' }'
``` ```
@ -79,19 +87,22 @@ apify call username/actorname --input='{
The Actor accepts a JSON schema matching the file `.actor/input_schema.json`. Below is a summary of the fields: The Actor accepts a JSON schema matching the file `.actor/input_schema.json`. Below is a summary of the fields:
| Field | Type | Required | Default | Description | | Field | Type | Required | Default | Description |
|----------------|---------|----------|----------|-----------------------------------------------------------------------------------------------------------| |----------------|---------|----------|----------|-------------------------------------------------------------------------------|
| `documentUrl` | string | Yes | None | URL of the document (PDF, image, DOCX, etc.) to be processed. Must be directly accessible via public URL. | | `http_sources` | object | Yes | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#url-endpoint |
| `outputFormat` | string | No | `md` | Desired output format. One of `md`, `json`, `html`, `text`, or `doctags`. | | `options` | object | No | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#common-parameters |
| `ocr` | boolean | No | `true` | If set to true, OCR will be applied to scanned PDFs or images for text recognition. |
### Example Input ### Example Input
```json ```json
{ {
"documentUrl": "https://arxiv.org/pdf/2408.09869.pdf", "options": {
"outputFormat": "md", "to_formats": ["md", "json", "html", "text", "doctags"]
"ocr": false },
"http_sources": [
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
{"url": "https://arxiv.org/pdf/2408.09869"}
]
} }
``` ```
@ -99,7 +110,7 @@ The Actor accepts a JSON schema matching the file `.actor/input_schema.json`. Be
The Actor provides three types of outputs: The Actor provides three types of outputs:
1. **Processed Document** - The Actor will provide the direct URL to your result in the run log, looking like: 1. **Processed Documents in a ZIP** - The Actor will provide the direct URL to your result in the run log, looking like:
```text ```text
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT' You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'
@ -108,8 +119,7 @@ The Actor provides three types of outputs:
2. **Processing Log** - Available in the key-value store as `DOCLING_LOG` 2. **Processing Log** - Available in the key-value store as `DOCLING_LOG`
3. **Dataset Record** - Contains processing metadata with: 3. **Dataset Record** - Contains processing metadata with:
- Input document URL - Direct link to the processed output zip file
- Direct link to the processed output
- Processing status - Processing status
You can access the results in several ways: You can access the results in several ways:
@ -219,7 +229,6 @@ Common issues and solutions:
The Actor implements comprehensive error handling: The Actor implements comprehensive error handling:
- Input validation for document URLs and parameters
- Detailed error messages in `DOCLING_LOG` - Detailed error messages in `DOCLING_LOG`
- Proper exit codes for different failure scenarios - Proper exit codes for different failure scenarios
- Automatic cleanup on failure - Automatic cleanup on failure
@ -237,7 +246,6 @@ If you wish to develop or modify this Actor locally:
- `actor.sh` - Main execution script that starts the docling-serve API and orchestrates document processing - `actor.sh` - Main execution script that starts the docling-serve API and orchestrates document processing
- `input_schema.json` - Input parameter definitions - `input_schema.json` - Input parameter definitions
- `dataset_schema.json` - Dataset output format definition - `dataset_schema.json` - Dataset output format definition
- `docling_processor.py` - Python script handling API communication with docling-serve
- `CHANGELOG.md` - Change log documenting all notable changes - `CHANGELOG.md` - Change log documenting all notable changes
- `README.md` - This documentation - `README.md` - This documentation
4. Run the Actor locally using: 4. Run the Actor locally using:

View File

@ -94,9 +94,13 @@ You can run Docling in the cloud without installation using the [Docling Actor](
```bash ```bash
apify call vancura/docling -i '{ apify call vancura/docling -i '{
"documentUrl": "https://arxiv.org/pdf/2408.09869", "options": {
"outputFormat": "md", "to_formats": ["md", "json", "html", "text", "doctags"]
"ocr": true },
"http_sources": [
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
{"url": "https://arxiv.org/pdf/2408.09869"}
]
}' }'
``` ```