mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-30 14:04:27 +00:00
Actor: Enhance README.md
with output details
Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored. Signed-off-by: Václav Vančura <commit@vancura.dev>
This commit is contained in:
parent
5ecd4a48aa
commit
3245e1b8b7
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
[](https://apify.com/vancura/docling)
|
[](https://apify.com/vancura/docling)
|
||||||
|
|
||||||
This Actor wraps the [Docling project](https://ds4sd.github.io/docling/) to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.
|
This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.io/docling/) to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.
|
||||||
|
|
||||||
## Table of Contents
|
## Table of Contents
|
||||||
|
|
||||||
@ -16,6 +16,7 @@ This Actor wraps the [Docling project](https://ds4sd.github.io/docling/) to prov
|
|||||||
8. [Requirements & Installation](#requirements--installation)
|
8. [Requirements & Installation](#requirements--installation)
|
||||||
9. [License](#license)
|
9. [License](#license)
|
||||||
10. [Acknowledgments](#acknowledgments)
|
10. [Acknowledgments](#acknowledgments)
|
||||||
|
11. [Security Considerations](#security-considerations)
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
@ -92,15 +93,31 @@ The Actor accepts a JSON schema matching the file `.actor/input_schema.json`. Be
|
|||||||
|
|
||||||
## Output
|
## Output
|
||||||
|
|
||||||
After processing, the final document is saved as `OUTPUT_RESULT` in the default Key-Value Store.
|
The Actor provides three types of outputs:
|
||||||
If the Actor logs warnings or debug info, these messages can be pushed to `DOCLING_LOG`.
|
|
||||||
|
|
||||||
You can retrieve the results programmatically by calling:
|
1. **Processed Document** - Saved as `OUTPUT_RESULT` in the default key-value store
|
||||||
|
2. **Processing Log** - Saved as `DOCLING_LOG` in the default key-value store
|
||||||
|
3. **Dataset Record** - Contains processing metadata with:
|
||||||
|
- Input document URL
|
||||||
|
- Direct link to the processed output
|
||||||
|
- Processing status
|
||||||
|
|
||||||
|
You can access the results in several ways:
|
||||||
|
|
||||||
|
1. **Direct URL** (shown in Actor run logs):
|
||||||
|
|
||||||
|
```text
|
||||||
|
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT_RESULT
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Programmatically** via Apify CLI:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
apify key-value-stores get-value OUTPUT_RESULT
|
apify key-value-stores get-value OUTPUT_RESULT
|
||||||
```
|
```
|
||||||
|
|
||||||
|
3. **Dataset** - Check the "Dataset" tab in the Actor run details to see processing metadata
|
||||||
|
|
||||||
### Example Outputs
|
### Example Outputs
|
||||||
|
|
||||||
#### Markdown (md)
|
#### Markdown (md)
|
||||||
@ -138,12 +155,32 @@ Content of section 2...
|
|||||||
<p>Content of section 1...</p>
|
<p>Content of section 1...</p>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Processing Logs (`DOCLING_LOG`)
|
||||||
|
|
||||||
|
The Actor maintains detailed processing logs including:
|
||||||
|
|
||||||
|
- Memory usage statistics
|
||||||
|
- Processing steps and timing
|
||||||
|
- Error messages and stack traces
|
||||||
|
- Input validation results
|
||||||
|
- OCR processing details (when enabled)
|
||||||
|
|
||||||
|
Access logs via:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
apify key-value-stores get-record DOCLING_LOG
|
||||||
|
```
|
||||||
|
|
||||||
## Performance & Resources
|
## Performance & Resources
|
||||||
|
|
||||||
- **Docker Image Size**: ~6 GB (includes OCR libraries and ML models)
|
- **Docker Image Size**: ~6 GB (includes OCR libraries and ML models)
|
||||||
- **Memory Requirements**:
|
- **Memory Requirements**:
|
||||||
- Minimum: 4 GB RAM
|
- Minimum: 4 GB RAM
|
||||||
- Recommended: 8 GB RAM for large documents
|
- Recommended: 8 GB RAM for large documents
|
||||||
|
- **Memory Monitoring**:
|
||||||
|
- Real-time memory usage tracking during processing
|
||||||
|
- Detailed memory statistics in `DOCLING_LOG`
|
||||||
|
- Automatic failure detection for out-of-memory situations
|
||||||
- **Processing Time**:
|
- **Processing Time**:
|
||||||
- Simple documents: 30-60 seconds
|
- Simple documents: 30-60 seconds
|
||||||
- Complex PDFs with OCR: 2-5 minutes
|
- Complex PDFs with OCR: 2-5 minutes
|
||||||
@ -173,6 +210,17 @@ Common issues and solutions:
|
|||||||
- Check if the document structure is compatible
|
- Check if the document structure is compatible
|
||||||
- Review the `DOCLING_LOG` for specific errors
|
- Review the `DOCLING_LOG` for specific errors
|
||||||
|
|
||||||
|
### Error Handling
|
||||||
|
|
||||||
|
The Actor implements comprehensive error handling:
|
||||||
|
|
||||||
|
- Input validation for document URLs and parameters
|
||||||
|
- Detailed error messages in `DOCLING_LOG`
|
||||||
|
- Proper exit codes for different failure scenarios
|
||||||
|
- Memory monitoring and out-of-memory detection
|
||||||
|
- Automatic cleanup on failure
|
||||||
|
- Dataset records with processing status
|
||||||
|
|
||||||
## Local Development
|
## Local Development
|
||||||
|
|
||||||
If you wish to develop or modify this Actor locally:
|
If you wish to develop or modify this Actor locally:
|
||||||
@ -232,3 +280,11 @@ This wrapper project is under the MIT License, matching the original Docling lic
|
|||||||
|
|
||||||
- [Docling](https://ds4sd.github.io/docling/) codebase by IBM
|
- [Docling](https://ds4sd.github.io/docling/) codebase by IBM
|
||||||
- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
|
- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
- Actor runs under a non-root user (appuser) for enhanced security
|
||||||
|
- Input URLs are validated before processing
|
||||||
|
- Temporary files are securely managed and cleaned up
|
||||||
|
- Process isolation through Docker containerization
|
||||||
|
- Secure handling of processing artifacts
|
||||||
|
@ -98,6 +98,12 @@ apify call vancura/docling -i '{
|
|||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The Actor stores results in:
|
||||||
|
|
||||||
|
* Processed document in key-value store (`OUTPUT_RESULT`)
|
||||||
|
* Processing logs (`DOCLING_LOG`)
|
||||||
|
* Dataset record with result URL and status
|
||||||
|
|
||||||
Read more about the [Docling Actor](.actor/README.md), including how to use it via the Apify API and CLI.
|
Read more about the [Docling Actor](.actor/README.md), including how to use it via the Apify API and CLI.
|
||||||
|
|
||||||
## Get help and support
|
## Get help and support
|
||||||
|
Loading…
Reference in New Issue
Block a user