mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-25 03:24:59 +00:00
Update test-cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
commit
f1f7df49e3
11
.actor/.dockerignore
Normal file
11
.actor/.dockerignore
Normal file
@ -0,0 +1,11 @@
|
||||
**/__pycache__
|
||||
**/*.pyc
|
||||
**/*.pyo
|
||||
**/*.pyd
|
||||
.git
|
||||
.gitignore
|
||||
.env
|
||||
.venv
|
||||
*.log
|
||||
.pytest_cache
|
||||
.coverage
|
69
.actor/CHANGELOG.md
Normal file
69
.actor/CHANGELOG.md
Normal file
@ -0,0 +1,69 @@
|
||||
# Changelog
|
||||
|
||||
All notable changes to the Docling Actor will be documented in this file.
|
||||
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [1.1.0] - 2025-03-09
|
||||
|
||||
### Changed
|
||||
|
||||
- Switched from full Docling CLI to docling-serve API
|
||||
- Using the official quay.io/ds4sd/docling-serve-cpu Docker image
|
||||
- Reduced Docker image size (from ~6GB to ~4GB)
|
||||
- Implemented multi-stage Docker build to handle dependencies
|
||||
- Improved Docker build process to ensure compatibility with docling-serve-cpu image
|
||||
- Added new Python processor script for reliable API communication and content extraction
|
||||
- Enhanced response handling with better content extraction logic
|
||||
- Fixed ES modules compatibility issue with Apify CLI
|
||||
- Added explicit tmpfs volume for temporary files
|
||||
- Fixed environment variables format in actor.json
|
||||
- Created optimized dependency installation approach
|
||||
- Improved API compatibility with docling-serve
|
||||
- Updated endpoint from custom `/convert` to standard `/v1alpha/convert/source`
|
||||
- Revised JSON payload structure to match docling-serve API format
|
||||
- Added proper output field parsing based on format
|
||||
- Enhanced startup process with health checks
|
||||
- Added configurable API host and port through environment variables
|
||||
- Better content type handling for different output formats
|
||||
- Updated error handling to align with API responses
|
||||
|
||||
### Fixed
|
||||
|
||||
- Fixed actor input file conflict in get_actor_input(): now checks for and removes an existing /tmp/actor-input/INPUT directory if found, ensuring valid JSON input parsing.
|
||||
|
||||
### Technical Details
|
||||
|
||||
- Actor Specification v1
|
||||
- Using quay.io/ds4sd/docling-serve-cpu:latest base image
|
||||
- Node.js 20.x for Apify CLI
|
||||
- Eliminated Python dependencies
|
||||
- Simplified Docker build process
|
||||
|
||||
## [1.0.0] - 2025-02-07
|
||||
|
||||
### Added
|
||||
|
||||
- Initial release of Docling Actor
|
||||
- Support for multiple document formats (PDF, DOCX, images)
|
||||
- OCR capabilities for scanned documents
|
||||
- Multiple output formats (md, json, html, text, doctags)
|
||||
- Comprehensive error handling and logging
|
||||
- Dataset records with processing status
|
||||
- Memory monitoring and resource optimization
|
||||
- Security features including non-root user execution
|
||||
|
||||
### Technical Details
|
||||
|
||||
- Actor Specification v1
|
||||
- Docling v2.17.0
|
||||
- Python 3.11
|
||||
- Node.js 20.x
|
||||
- Comprehensive error codes:
|
||||
- 10: Invalid input
|
||||
- 11: URL inaccessible
|
||||
- 12: Docling processing failed
|
||||
- 13: Output file missing
|
||||
- 14: Storage operation failed
|
||||
- 15: OCR processing failed
|
87
.actor/Dockerfile
Normal file
87
.actor/Dockerfile
Normal file
@ -0,0 +1,87 @@
|
||||
# Build stage for installing dependencies
|
||||
FROM node:20-slim AS builder
|
||||
|
||||
# Install necessary tools and prepare dependencies environment in one layer
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
ca-certificates \
|
||||
&& rm -rf /var/lib/apt/lists/* \
|
||||
&& mkdir -p /build/bin /build/lib/node_modules \
|
||||
&& cp /usr/local/bin/node /build/bin/
|
||||
# Set working directory
|
||||
WORKDIR /build
|
||||
|
||||
# Create package.json and install Apify CLI in one layer
|
||||
RUN echo '{"name":"docling-actor-dependencies","version":"1.0.0","description":"Dependencies for Docling Actor","private":true,"type":"module","engines":{"node":">=18"}}' > package.json \
|
||||
&& npm install apify-cli@latest \
|
||||
&& cp -r node_modules/* lib/node_modules/ \
|
||||
&& echo '#!/bin/sh\n/tmp/docling-tools/bin/node /tmp/docling-tools/lib/node_modules/apify-cli/bin/run "$@"' > bin/actor \
|
||||
&& chmod +x bin/actor \
|
||||
# Clean up npm cache to reduce image size
|
||||
&& npm cache clean --force
|
||||
|
||||
# Final stage with docling-serve-cpu
|
||||
FROM quay.io/ds4sd/docling-serve-cpu:latest
|
||||
|
||||
LABEL maintainer="Vaclav Vancura <@vancura>" \
|
||||
description="Apify Actor for document processing using Docling" \
|
||||
version="1.1.0"
|
||||
|
||||
# Set only essential environment variables
|
||||
ENV PYTHONUNBUFFERED=1 \
|
||||
PYTHONDONTWRITEBYTECODE=1 \
|
||||
DOCLING_SERVE_HOST=0.0.0.0 \
|
||||
DOCLING_SERVE_PORT=5001
|
||||
|
||||
# Switch to root temporarily to set up directories and permissions
|
||||
USER root
|
||||
WORKDIR /app
|
||||
|
||||
# Install required tools and create directories in a single layer
|
||||
RUN dnf install -y \
|
||||
jq \
|
||||
&& dnf clean all \
|
||||
&& mkdir -p /build-files \
|
||||
/tmp \
|
||||
/tmp/actor-input \
|
||||
/tmp/actor-output \
|
||||
/tmp/actor-storage \
|
||||
/tmp/apify_input \
|
||||
/apify_input \
|
||||
/opt/app-root/src/.EasyOCR/user_network \
|
||||
/tmp/easyocr-models \
|
||||
&& chown 1000:1000 /build-files \
|
||||
&& chown -R 1000:1000 /opt/app-root/src/.EasyOCR \
|
||||
&& chmod 1777 /tmp \
|
||||
&& chmod 1777 /tmp/easyocr-models \
|
||||
&& chmod 777 /tmp/actor-input /tmp/actor-output /tmp/actor-storage /tmp/apify_input /apify_input \
|
||||
# Fix for uv_os_get_passwd error in Node.js
|
||||
&& echo "docling:x:1000:1000:Docling User:/app:/bin/sh" >> /etc/passwd
|
||||
|
||||
# Set environment variable to tell EasyOCR to use a writable location for models
|
||||
ENV EASYOCR_MODULE_PATH=/tmp/easyocr-models
|
||||
|
||||
# Copy only required files
|
||||
COPY --chown=1000:1000 .actor/actor.sh .actor/actor.sh
|
||||
COPY --chown=1000:1000 .actor/actor.json .actor/actor.json
|
||||
COPY --chown=1000:1000 .actor/input_schema.json .actor/input_schema.json
|
||||
COPY --chown=1000:1000 .actor/docling_processor.py .actor/docling_processor.py
|
||||
RUN chmod +x .actor/actor.sh
|
||||
|
||||
# Copy the build files from builder
|
||||
COPY --from=builder --chown=1000:1000 /build /build-files
|
||||
|
||||
|
||||
# Switch to non-root user
|
||||
USER 1000
|
||||
|
||||
# Set up TMPFS for temporary files
|
||||
VOLUME ["/tmp"]
|
||||
|
||||
# Create additional volumes for OCR models persistence
|
||||
VOLUME ["/tmp/easyocr-models"]
|
||||
|
||||
# Expose the docling-serve API port
|
||||
EXPOSE 5001
|
||||
|
||||
# Run the actor script
|
||||
ENTRYPOINT [".actor/actor.sh"]
|
314
.actor/README.md
Normal file
314
.actor/README.md
Normal file
@ -0,0 +1,314 @@
|
||||
# Docling Actor on Apify
|
||||
|
||||
[](https://apify.com/vancura/docling)
|
||||
|
||||
This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.io/docling/) to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.
|
||||
|
||||
## What are Actors?
|
||||
|
||||
[Actors](https://docs.apify.com/platform/actors?fpr=docling) are serverless microservices running on the [Apify Platform](https://apify.com/?fpr=docling). They are based on the [Actor SDK](https://docs.apify.com/sdk/js?fpr=docling) and can be found in the [Apify Store](https://apify.com/store?fpr=docling). Learn more about Actors in the [Apify Whitepaper](https://whitepaper.actor?fpr=docling).
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Features](#features)
|
||||
2. [Usage](#usage)
|
||||
3. [Input Parameters](#input-parameters)
|
||||
4. [Output](#output)
|
||||
5. [Performance & Resources](#performance--resources)
|
||||
6. [Troubleshooting](#troubleshooting)
|
||||
7. [Local Development](#local-development)
|
||||
8. [Architecture](#architecture)
|
||||
9. [License](#license)
|
||||
10. [Acknowledgments](#acknowledgments)
|
||||
11. [Security Considerations](#security-considerations)
|
||||
|
||||
## Features
|
||||
|
||||
- Leverages the official docling-serve-cpu Docker image for efficient document processing
|
||||
- Processes multiple document formats:
|
||||
- PDF documents (scanned or digital)
|
||||
- Microsoft Office files (DOCX, XLSX, PPTX)
|
||||
- Images (PNG, JPG, TIFF)
|
||||
- Other text-based formats
|
||||
- Provides OCR capabilities for scanned documents
|
||||
- Exports to multiple formats:
|
||||
- Markdown
|
||||
- JSON
|
||||
- HTML
|
||||
- Plain Text
|
||||
- DocTags (structured format)
|
||||
- No local setup needed—just provide input via a simple JSON config
|
||||
|
||||
## Usage
|
||||
|
||||
### Using Apify Console
|
||||
|
||||
1. Go to the Apify Actor page.
|
||||
2. Click "Run".
|
||||
3. In the input form, fill in:
|
||||
- The URL of the document.
|
||||
- Output format (`md`, `json`, `html`, `text`, or `doctags`).
|
||||
- OCR boolean toggle.
|
||||
4. The Actor will run and produce its outputs in the default key-value store under the key `OUTPUT`.
|
||||
|
||||
### Using Apify API
|
||||
|
||||
```bash
|
||||
curl --request POST \
|
||||
--url "https://api.apify.com/v2/acts/vancura~docling/run" \
|
||||
--header 'Content-Type: application/json' \
|
||||
--header 'Authorization: Bearer YOUR_API_TOKEN' \
|
||||
--data '{
|
||||
"options": {
|
||||
"to_formats": ["md", "json", "html", "text", "doctags"]
|
||||
},
|
||||
"http_sources": [
|
||||
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
|
||||
{"url": "https://arxiv.org/pdf/2408.09869"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Using Apify CLI
|
||||
|
||||
```bash
|
||||
apify call vancura/docling --input='{
|
||||
"options": {
|
||||
"to_formats": ["md", "json", "html", "text", "doctags"]
|
||||
},
|
||||
"http_sources": [
|
||||
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
|
||||
{"url": "https://arxiv.org/pdf/2408.09869"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Input Parameters
|
||||
|
||||
The Actor accepts a JSON schema matching the file `.actor/input_schema.json`. Below is a summary of the fields:
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|----------------|---------|----------|----------|-------------------------------------------------------------------------------|
|
||||
| `http_sources` | object | Yes | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#url-endpoint |
|
||||
| `options` | object | No | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#common-parameters |
|
||||
|
||||
### Example Input
|
||||
|
||||
```json
|
||||
{
|
||||
"options": {
|
||||
"to_formats": ["md", "json", "html", "text", "doctags"]
|
||||
},
|
||||
"http_sources": [
|
||||
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
|
||||
{"url": "https://arxiv.org/pdf/2408.09869"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The Actor provides three types of outputs:
|
||||
|
||||
1. **Processed Documents in a ZIP** - The Actor will provide the direct URL to your result in the run log, looking like:
|
||||
|
||||
```text
|
||||
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'
|
||||
```
|
||||
|
||||
2. **Processing Log** - Available in the key-value store as `DOCLING_LOG`
|
||||
|
||||
3. **Dataset Record** - Contains processing metadata with:
|
||||
- Direct link to the processed output zip file
|
||||
- Processing status
|
||||
|
||||
You can access the results in several ways:
|
||||
|
||||
1. **Direct URL** (shown in Actor run logs):
|
||||
|
||||
```text
|
||||
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT
|
||||
```
|
||||
|
||||
2. **Programmatically** via Apify CLI:
|
||||
|
||||
```bash
|
||||
apify key-value-stores get-value OUTPUT
|
||||
```
|
||||
|
||||
3. **Dataset** - Check the "Dataset" tab in the Actor run details to see processing metadata
|
||||
|
||||
### Example Outputs
|
||||
|
||||
#### Markdown (md)
|
||||
|
||||
```markdown
|
||||
# Document Title
|
||||
|
||||
## Section 1
|
||||
Content of section 1...
|
||||
|
||||
## Section 2
|
||||
Content of section 2...
|
||||
```
|
||||
|
||||
#### JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "Document Title",
|
||||
"sections": [
|
||||
{
|
||||
"level": 1,
|
||||
"title": "Section 1",
|
||||
"content": "Content of section 1..."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### HTML
|
||||
|
||||
```html
|
||||
<h1>Document Title</h1>
|
||||
<h2>Section 1</h2>
|
||||
<p>Content of section 1...</p>
|
||||
```
|
||||
|
||||
### Processing Logs (`DOCLING_LOG`)
|
||||
|
||||
The Actor maintains detailed processing logs including:
|
||||
|
||||
- API request and response details
|
||||
- Processing steps and timing
|
||||
- Error messages and stack traces
|
||||
- Input validation results
|
||||
|
||||
Access logs via:
|
||||
|
||||
```bash
|
||||
apify key-value-stores get-record DOCLING_LOG
|
||||
```
|
||||
|
||||
## Performance & Resources
|
||||
|
||||
- **Docker Image Size**: ~4GB
|
||||
- **Memory Requirements**:
|
||||
- Minimum: 2 GB RAM
|
||||
- Recommended: 4 GB RAM for large or complex documents
|
||||
- **Processing Time**:
|
||||
- Simple documents: 15-30 seconds
|
||||
- Complex PDFs with OCR: 1-3 minutes
|
||||
- Large documents (100+ pages): 3-10 minutes
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Common issues and solutions:
|
||||
|
||||
1. **Document URL Not Accessible**
|
||||
- Ensure the URL is publicly accessible
|
||||
- Check if the document requires authentication
|
||||
- Verify the URL leads directly to the document
|
||||
|
||||
2. **OCR Processing Fails**
|
||||
- Verify the document is not password-protected
|
||||
- Check if the image quality is sufficient
|
||||
- Try processing with OCR disabled
|
||||
|
||||
3. **API Response Issues**
|
||||
- Check the logs for detailed error messages
|
||||
- Ensure the document format is supported
|
||||
- Verify the URL is correctly formatted
|
||||
|
||||
4. **Output Format Issues**
|
||||
- Verify the output format is supported
|
||||
- Check if the document structure is compatible
|
||||
- Review the `DOCLING_LOG` for specific errors
|
||||
|
||||
### Error Handling
|
||||
|
||||
The Actor implements comprehensive error handling:
|
||||
|
||||
- Detailed error messages in `DOCLING_LOG`
|
||||
- Proper exit codes for different failure scenarios
|
||||
- Automatic cleanup on failure
|
||||
- Dataset records with processing status
|
||||
|
||||
## Local Development
|
||||
|
||||
If you wish to develop or modify this Actor locally:
|
||||
|
||||
1. Clone the repository.
|
||||
2. Ensure Docker is installed.
|
||||
3. The Actor files are located in the `.actor` directory:
|
||||
- `Dockerfile` - Defines the container environment
|
||||
- `actor.json` - Actor configuration and metadata
|
||||
- `actor.sh` - Main execution script that starts the docling-serve API and orchestrates document processing
|
||||
- `input_schema.json` - Input parameter definitions
|
||||
- `dataset_schema.json` - Dataset output format definition
|
||||
- `CHANGELOG.md` - Change log documenting all notable changes
|
||||
- `README.md` - This documentation
|
||||
4. Run the Actor locally using:
|
||||
|
||||
```bash
|
||||
apify run
|
||||
```
|
||||
|
||||
### Actor Structure
|
||||
|
||||
```text
|
||||
.actor/
|
||||
├── Dockerfile # Container definition
|
||||
├── actor.json # Actor metadata
|
||||
├── actor.sh # Execution script (also starts docling-serve API)
|
||||
├── input_schema.json # Input parameters
|
||||
├── dataset_schema.json # Dataset output format definition
|
||||
├── docling_processor.py # Python script for API communication
|
||||
├── CHANGELOG.md # Version history and changes
|
||||
└── README.md # This documentation
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
This Actor uses a lightweight architecture based on the official `quay.io/ds4sd/docling-serve-cpu` Docker image:
|
||||
|
||||
- **Base Image**: `quay.io/ds4sd/docling-serve-cpu:latest` (~4GB)
|
||||
- **Multi-Stage Build**: Uses a multi-stage Docker build to include only necessary tools
|
||||
- **API Communication**: Uses the RESTful API provided by docling-serve
|
||||
- **Request Flow**:
|
||||
1. The actor script starts the docling-serve API on port 5001
|
||||
2. Performs health checks to ensure the API is running
|
||||
3. Processes the input parameters
|
||||
4. Creates a JSON payload for the docling-serve API with proper format:
|
||||
```json
|
||||
{
|
||||
"options": {
|
||||
"to_formats": ["md"],
|
||||
"do_ocr": true
|
||||
},
|
||||
"http_sources": [{"url": "https://example.com/document.pdf"}]
|
||||
}
|
||||
```
|
||||
5. Makes a POST request to the `/v1alpha/convert/source` endpoint
|
||||
6. Processes the response and stores it in the key-value store
|
||||
- **Dependencies**:
|
||||
- Node.js for Apify CLI
|
||||
- Essential tools (curl, jq, etc.) copied from build stage
|
||||
- **Security**: Runs as a non-root user for enhanced security
|
||||
|
||||
## License
|
||||
|
||||
This wrapper project is under the MIT License, matching the original Docling license. See [LICENSE](../LICENSE) for details.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- [Docling](https://ds4sd.github.io/docling/) and [docling-serve-cpu](https://quay.io/repository/ds4sd/docling-serve-cpu) by IBM
|
||||
- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Actor runs under a non-root user for enhanced security
|
||||
- Input URLs are validated before processing
|
||||
- Temporary files are securely managed and cleaned up
|
||||
- Process isolation through Docker containerization
|
||||
- Secure handling of processing artifacts
|
11
.actor/actor.json
Normal file
11
.actor/actor.json
Normal file
@ -0,0 +1,11 @@
|
||||
{
|
||||
"actorSpecification": 1,
|
||||
"name": "docling",
|
||||
"version": "0.0",
|
||||
"environmentVariables": {},
|
||||
"dockerFile": "./Dockerfile",
|
||||
"input": "./input_schema.json",
|
||||
"scripts": {
|
||||
"run": "./actor.sh"
|
||||
}
|
||||
}
|
419
.actor/actor.sh
Executable file
419
.actor/actor.sh
Executable file
@ -0,0 +1,419 @@
|
||||
#!/bin/bash
|
||||
|
||||
export PATH=$PATH:/build-files/node_modules/.bin
|
||||
|
||||
# Function to upload content to the key-value store
|
||||
upload_to_kvs() {
|
||||
local content_file="$1"
|
||||
local key_name="$2"
|
||||
local content_type="$3"
|
||||
local description="$4"
|
||||
|
||||
# Find the Apify CLI command
|
||||
find_apify_cmd
|
||||
local apify_cmd="$FOUND_APIFY_CMD"
|
||||
|
||||
if [ -n "$apify_cmd" ]; then
|
||||
echo "Uploading $description to key-value store (key: $key_name)..."
|
||||
|
||||
# Create a temporary home directory with write permissions
|
||||
setup_temp_environment
|
||||
|
||||
# Use the --no-update-notifier flag if available
|
||||
if $apify_cmd --help | grep -q "\--no-update-notifier"; then
|
||||
if $apify_cmd --no-update-notifier actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
|
||||
echo "Successfully uploaded $description to key-value store"
|
||||
local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
|
||||
echo "$description available at: $url"
|
||||
cleanup_temp_environment
|
||||
return 0
|
||||
fi
|
||||
else
|
||||
# Fall back to regular command if flag isn't available
|
||||
if $apify_cmd actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
|
||||
echo "Successfully uploaded $description to key-value store"
|
||||
local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
|
||||
echo "$description available at: $url"
|
||||
cleanup_temp_environment
|
||||
return 0
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "ERROR: Failed to upload $description to key-value store"
|
||||
cleanup_temp_environment
|
||||
return 1
|
||||
else
|
||||
echo "ERROR: Apify CLI not found for $description upload"
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to find Apify CLI command
|
||||
find_apify_cmd() {
|
||||
FOUND_APIFY_CMD=""
|
||||
for cmd in "apify" "actor" "/usr/local/bin/apify" "/usr/bin/apify" "/opt/apify/cli/bin/apify"; do
|
||||
if command -v "$cmd" &> /dev/null; then
|
||||
FOUND_APIFY_CMD="$cmd"
|
||||
break
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Function to set up temporary environment for Apify CLI
|
||||
setup_temp_environment() {
|
||||
export TMPDIR="/tmp/apify-home-${RANDOM}"
|
||||
mkdir -p "$TMPDIR"
|
||||
export APIFY_DISABLE_VERSION_CHECK=1
|
||||
export NODE_OPTIONS="--no-warnings"
|
||||
export HOME="$TMPDIR" # Override home directory to writable location
|
||||
}
|
||||
|
||||
# Function to clean up temporary environment
|
||||
cleanup_temp_environment() {
|
||||
rm -rf "$TMPDIR" 2>/dev/null || true
|
||||
}
|
||||
|
||||
# Function to push data to Apify dataset
|
||||
push_to_dataset() {
|
||||
# Example usage: push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
|
||||
|
||||
local result_url="$1"
|
||||
local size="$2"
|
||||
local format="$3"
|
||||
|
||||
# Find Apify CLI command
|
||||
find_apify_cmd
|
||||
local apify_cmd="$FOUND_APIFY_CMD"
|
||||
|
||||
if [ -n "$apify_cmd" ]; then
|
||||
echo "Adding record to dataset..."
|
||||
setup_temp_environment
|
||||
|
||||
# Use the --no-update-notifier flag if available
|
||||
if $apify_cmd --help | grep -q "\--no-update-notifier"; then
|
||||
if $apify_cmd --no-update-notifier actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
|
||||
echo "Successfully added record to dataset"
|
||||
else
|
||||
echo "Warning: Failed to add record to dataset"
|
||||
fi
|
||||
else
|
||||
# Fall back to regular command
|
||||
if $apify_cmd actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
|
||||
echo "Successfully added record to dataset"
|
||||
else
|
||||
echo "Warning: Failed to add record to dataset"
|
||||
fi
|
||||
fi
|
||||
|
||||
cleanup_temp_environment
|
||||
fi
|
||||
}
|
||||
|
||||
|
||||
# --- Setup logging and error handling ---
|
||||
|
||||
LOG_FILE="/tmp/docling.log"
|
||||
touch "$LOG_FILE" || {
|
||||
echo "Fatal: Cannot create log file at $LOG_FILE"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Log to both console and file
|
||||
exec 1> >(tee -a "$LOG_FILE")
|
||||
exec 2> >(tee -a "$LOG_FILE" >&2)
|
||||
|
||||
# Exit codes
|
||||
readonly ERR_API_UNAVAILABLE=15
|
||||
readonly ERR_INVALID_INPUT=16
|
||||
|
||||
|
||||
# --- Debug environment ---
|
||||
|
||||
echo "Date: $(date)"
|
||||
echo "Python version: $(python --version 2>&1)"
|
||||
echo "Docling-serve path: $(which docling-serve 2>/dev/null || echo 'Not found')"
|
||||
echo "Working directory: $(pwd)"
|
||||
|
||||
# --- Get input ---
|
||||
|
||||
echo "Getting Apify Actor Input"
|
||||
INPUT=$(apify actor get-input 2>/dev/null)
|
||||
|
||||
# --- Setup tools ---
|
||||
|
||||
echo "Setting up tools..."
|
||||
TOOLS_DIR="/tmp/docling-tools"
|
||||
mkdir -p "$TOOLS_DIR"
|
||||
|
||||
# Copy tools if available
|
||||
if [ -d "/build-files" ]; then
|
||||
echo "Copying tools from /build-files..."
|
||||
cp -r /build-files/* "$TOOLS_DIR/"
|
||||
export PATH="$TOOLS_DIR/bin:$PATH"
|
||||
else
|
||||
echo "Warning: No build files directory found. Some tools may be unavailable."
|
||||
fi
|
||||
|
||||
# Copy Python processor script to tools directory
|
||||
PYTHON_SCRIPT_PATH="$(dirname "$0")/docling_processor.py"
|
||||
if [ -f "$PYTHON_SCRIPT_PATH" ]; then
|
||||
echo "Copying Python processor script to tools directory..."
|
||||
cp "$PYTHON_SCRIPT_PATH" "$TOOLS_DIR/"
|
||||
chmod +x "$TOOLS_DIR/docling_processor.py"
|
||||
else
|
||||
echo "ERROR: Python processor script not found at $PYTHON_SCRIPT_PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check OCR directories and ensure they're writable
|
||||
echo "Checking OCR directory permissions..."
|
||||
OCR_DIR="/opt/app-root/src/.EasyOCR"
|
||||
if [ -d "$OCR_DIR" ]; then
|
||||
# Test if we can write to the directory
|
||||
if touch "$OCR_DIR/test_write" 2>/dev/null; then
|
||||
echo "[✓] OCR directory is writable"
|
||||
rm "$OCR_DIR/test_write"
|
||||
else
|
||||
echo "[✗] OCR directory is not writable, setting up alternative in /tmp"
|
||||
|
||||
# Create alternative in /tmp (which is writable)
|
||||
mkdir -p "/tmp/.EasyOCR/user_network"
|
||||
export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
|
||||
fi
|
||||
else
|
||||
echo "OCR directory not found, creating in /tmp"
|
||||
mkdir -p "/tmp/.EasyOCR/user_network"
|
||||
export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
|
||||
fi
|
||||
|
||||
|
||||
# --- Starting the API ---
|
||||
|
||||
echo "Starting docling-serve API..."
|
||||
|
||||
# Create a dedicated working directory in /tmp (writable)
|
||||
API_DIR="/tmp/docling-api"
|
||||
mkdir -p "$API_DIR"
|
||||
cd "$API_DIR"
|
||||
echo "API working directory: $(pwd)"
|
||||
|
||||
# Find docling-serve executable
|
||||
DOCLING_SERVE_PATH=$(which docling-serve)
|
||||
echo "Docling-serve executable: $DOCLING_SERVE_PATH"
|
||||
|
||||
# Start the API with minimal parameters to avoid any issues
|
||||
echo "Starting docling-serve API..."
|
||||
"$DOCLING_SERVE_PATH" run --host 0.0.0.0 --port 5001 > "$API_DIR/docling-serve.log" 2>&1 &
|
||||
API_PID=$!
|
||||
echo "Started docling-serve API with PID: $API_PID"
|
||||
|
||||
# A more reliable wait for API startup
|
||||
echo "Waiting for API to initialize..."
|
||||
MAX_TRIES=30
|
||||
tries=0
|
||||
started=false
|
||||
|
||||
while [ $tries -lt $MAX_TRIES ]; do
|
||||
tries=$((tries + 1))
|
||||
|
||||
# Check if process is still running
|
||||
if ! ps -p $API_PID > /dev/null; then
|
||||
echo "ERROR: docling-serve API process terminated unexpectedly after $tries seconds"
|
||||
break
|
||||
fi
|
||||
|
||||
# Check log for startup completion or errors
|
||||
if grep -q "Application startup complete" "$API_DIR/docling-serve.log" 2>/dev/null; then
|
||||
echo "[✓] API startup completed successfully after $tries seconds"
|
||||
started=true
|
||||
break
|
||||
fi
|
||||
|
||||
if grep -q "Permission denied\|PermissionError" "$API_DIR/docling-serve.log" 2>/dev/null; then
|
||||
echo "ERROR: Permission errors detected in API startup"
|
||||
break
|
||||
fi
|
||||
|
||||
# Sleep and check again
|
||||
sleep 1
|
||||
|
||||
# Output a progress indicator every 5 seconds
|
||||
if [ $((tries % 5)) -eq 0 ]; then
|
||||
echo "Still waiting for API startup... ($tries/$MAX_TRIES seconds)"
|
||||
fi
|
||||
done
|
||||
|
||||
# Show log content regardless of outcome
|
||||
echo "docling-serve log output so far:"
|
||||
tail -n 20 "$API_DIR/docling-serve.log"
|
||||
|
||||
# Verify the API is running
|
||||
if ! ps -p $API_PID > /dev/null; then
|
||||
echo "ERROR: docling-serve API failed to start"
|
||||
if [ -f "$API_DIR/docling-serve.log" ]; then
|
||||
echo "Full log output:"
|
||||
cat "$API_DIR/docling-serve.log"
|
||||
fi
|
||||
exit $ERR_API_UNAVAILABLE
|
||||
fi
|
||||
|
||||
if [ "$started" != "true" ]; then
|
||||
echo "WARNING: API process is running but startup completion was not detected"
|
||||
echo "Will attempt to continue anyway..."
|
||||
fi
|
||||
|
||||
# Try to verify API is responding at this point
|
||||
echo "Verifying API responsiveness..."
|
||||
(python -c "
|
||||
import sys, time, socket
|
||||
for i in range(5):
|
||||
try:
|
||||
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
s.settimeout(1)
|
||||
result = s.connect_ex(('localhost', 5001))
|
||||
if result == 0:
|
||||
s.close()
|
||||
print('Port 5001 is open and accepting connections')
|
||||
sys.exit(0)
|
||||
s.close()
|
||||
except Exception as e:
|
||||
pass
|
||||
time.sleep(1)
|
||||
print('Could not connect to API port after 5 attempts')
|
||||
sys.exit(1)
|
||||
" && echo "API verification succeeded") || echo "API verification failed, but continuing anyway"
|
||||
|
||||
# Define API endpoint
|
||||
DOCLING_API_ENDPOINT="http://localhost:5001/v1alpha/convert/source"
|
||||
|
||||
|
||||
# --- Processing document ---
|
||||
|
||||
echo "Starting document processing..."
|
||||
echo "Reading input from Apify..."
|
||||
|
||||
echo "Input content:" >&2
|
||||
echo "$INPUT" >&2 # Send the raw input to stderr for debugging
|
||||
echo "$INPUT" # Send the clean JSON to stdout for processing
|
||||
|
||||
# Create the request JSON
|
||||
|
||||
REQUEST_JSON=$(echo $INPUT | jq '.options += {"return_as_file": true}')
|
||||
|
||||
echo "Creating request JSON:" >&2
|
||||
echo "$REQUEST_JSON" >&2
|
||||
echo "$REQUEST_JSON" > "$API_DIR/request.json"
|
||||
|
||||
|
||||
# Send the conversion request using our Python script
|
||||
#echo "Sending conversion request to docling-serve API..."
|
||||
#python "$TOOLS_DIR/docling_processor.py" \
|
||||
# --api-endpoint "$DOCLING_API_ENDPOINT" \
|
||||
# --request-json "$API_DIR/request.json" \
|
||||
# --output-dir "$API_DIR" \
|
||||
# --output-format "$OUTPUT_FORMAT"
|
||||
|
||||
echo "Curl the Docling API"
|
||||
curl -s -H "content-type: application/json" -X POST --data-binary @$API_DIR/request.json -o $API_DIR/output.zip $DOCLING_API_ENDPOINT
|
||||
|
||||
CURL_EXIT_CODE=$?
|
||||
|
||||
# --- Check for various potential output files ---
|
||||
|
||||
echo "Checking for output files..."
|
||||
if [ -f "$API_DIR/output.zip" ]; then
|
||||
echo "Conversion completed successfully! Output file found."
|
||||
|
||||
# Get content from the converted file
|
||||
OUTPUT_SIZE=$(wc -c < "$API_DIR/output.zip")
|
||||
echo "Output file found with size: $OUTPUT_SIZE bytes"
|
||||
|
||||
# Calculate the access URL for result display
|
||||
RESULT_URL="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/OUTPUT"
|
||||
|
||||
echo "=============================="
|
||||
echo "PROCESSING COMPLETE!"
|
||||
echo "Output size: ${OUTPUT_SIZE} bytes"
|
||||
echo "=============================="
|
||||
|
||||
# Set the output content type based on format
|
||||
CONTENT_TYPE="application/zip"
|
||||
|
||||
# Upload the document content using our function
|
||||
upload_to_kvs "$API_DIR/output.zip" "OUTPUT" "$CONTENT_TYPE" "Document content"
|
||||
|
||||
# Only proceed with dataset record if document upload succeeded
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "Your document is available at: ${RESULT_URL}"
|
||||
echo "=============================="
|
||||
|
||||
# Push data to dataset
|
||||
push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
|
||||
fi
|
||||
else
|
||||
echo "ERROR: No converted output file found at $API_DIR/output.zip"
|
||||
|
||||
# Create error metadata
|
||||
ERROR_METADATA="{\"status\":\"error\",\"error\":\"No converted output file found\",\"documentUrl\":\"$DOCUMENT_URL\"}"
|
||||
echo "$ERROR_METADATA" > "/tmp/actor-output/OUTPUT"
|
||||
chmod 644 "/tmp/actor-output/OUTPUT"
|
||||
|
||||
echo "Error information has been saved to /tmp/actor-output/OUTPUT"
|
||||
fi
|
||||
|
||||
|
||||
# --- Verify output files for debugging ---
|
||||
|
||||
echo "=== Final Output Verification ==="
|
||||
echo "Files in /tmp/actor-output:"
|
||||
ls -la /tmp/actor-output/ 2>/dev/null || echo "Cannot list /tmp/actor-output/"
|
||||
|
||||
echo "All operations completed. The output should be available in the default key-value store."
|
||||
echo "Content URL: ${RESULT_URL:-No URL available}"
|
||||
|
||||
|
||||
# --- Cleanup function ---
|
||||
|
||||
cleanup() {
|
||||
echo "Running cleanup..."
|
||||
|
||||
# Stop the API process
|
||||
if [ -n "$API_PID" ]; then
|
||||
echo "Stopping docling-serve API (PID: $API_PID)..."
|
||||
kill $API_PID 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# Export log file to KVS if it exists
|
||||
# DO THIS BEFORE REMOVING TOOLS DIRECTORY
|
||||
if [ -f "$LOG_FILE" ]; then
|
||||
if [ -s "$LOG_FILE" ]; then
|
||||
echo "Log file is not empty, pushing to key-value store (key: LOG)..."
|
||||
|
||||
# Upload log using our function
|
||||
upload_to_kvs "$LOG_FILE" "LOG" "text/plain" "Log file"
|
||||
else
|
||||
echo "Warning: log file exists but is empty"
|
||||
fi
|
||||
else
|
||||
echo "Warning: No log file found"
|
||||
fi
|
||||
|
||||
# Clean up temporary files AFTER log is uploaded
|
||||
echo "Cleaning up temporary files..."
|
||||
if [ -d "$API_DIR" ]; then
|
||||
echo "Removing API working directory: $API_DIR"
|
||||
rm -rf "$API_DIR" 2>/dev/null || echo "Warning: Failed to remove $API_DIR"
|
||||
fi
|
||||
|
||||
if [ -d "$TOOLS_DIR" ]; then
|
||||
echo "Removing tools directory: $TOOLS_DIR"
|
||||
rm -rf "$TOOLS_DIR" 2>/dev/null || echo "Warning: Failed to remove $TOOLS_DIR"
|
||||
fi
|
||||
|
||||
# Keep log file until the very end
|
||||
echo "Script execution completed at $(date)"
|
||||
echo "Actor execution completed"
|
||||
}
|
||||
|
||||
# Register cleanup
|
||||
trap cleanup EXIT
|
31
.actor/dataset_schema.json
Normal file
31
.actor/dataset_schema.json
Normal file
@ -0,0 +1,31 @@
|
||||
{
|
||||
"title": "Docling Actor Dataset",
|
||||
"description": "Records of document processing results from the Docling Actor",
|
||||
"type": "object",
|
||||
"schemaVersion": 1,
|
||||
"properties": {
|
||||
"url": {
|
||||
"title": "Document URL",
|
||||
"type": "string",
|
||||
"description": "URL of the processed document"
|
||||
},
|
||||
"output_file": {
|
||||
"title": "Result URL",
|
||||
"type": "string",
|
||||
"description": "Direct URL to the processed result in key-value store"
|
||||
},
|
||||
"status": {
|
||||
"title": "Processing Status",
|
||||
"type": "string",
|
||||
"description": "Status of the document processing",
|
||||
"enum": ["success", "error"]
|
||||
},
|
||||
"error": {
|
||||
"title": "Error Details",
|
||||
"type": "string",
|
||||
"description": "Error message if processing failed",
|
||||
"optional": true
|
||||
}
|
||||
},
|
||||
"required": ["url", "output_file", "status"]
|
||||
}
|
27
.actor/input_schema.json
Normal file
27
.actor/input_schema.json
Normal file
@ -0,0 +1,27 @@
|
||||
{
|
||||
"title": "Docling Actor Input",
|
||||
"description": "Options for processing documents with Docling via the docling-serve API.",
|
||||
"type": "object",
|
||||
"schemaVersion": 1,
|
||||
"properties": {
|
||||
"http_sources": {
|
||||
"title": "Document URLs",
|
||||
"type": "array",
|
||||
"description": "URLs of documents to process. Supported formats: PDF, DOCX, PPTX, XLSX, HTML, MD, XML, images, and more.",
|
||||
"editor": "json",
|
||||
"prefill": [
|
||||
{ "url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf" }
|
||||
]
|
||||
},
|
||||
"options": {
|
||||
"title": "Processing Options",
|
||||
"type": "object",
|
||||
"description": "Document processing configuration options",
|
||||
"editor": "json",
|
||||
"prefill": {
|
||||
"to_formats": ["md"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["options", "http_sources"]
|
||||
}
|
37
CHANGELOG.md
37
CHANGELOG.md
@ -1,3 +1,40 @@
|
||||
## [v2.28.0](https://github.com/docling-project/docling/releases/tag/v2.28.0) - 2025-03-19
|
||||
|
||||
### Feature
|
||||
|
||||
* **SmolDocling:** Support MLX acceleration in VLM pipeline ([#1199](https://github.com/docling-project/docling/issues/1199)) ([`1c26769`](https://github.com/docling-project/docling/commit/1c26769785bcd17c0b8b621c5182ad81134d3915))
|
||||
* Add PPTX notes slides ([#474](https://github.com/docling-project/docling/issues/474)) ([`b454aa1`](https://github.com/docling-project/docling/commit/b454aa1551b891644ce4028ed2d7ec8f82c167ab))
|
||||
* Updated vlm pipeline (with latest changes from docling-core) ([#1158](https://github.com/docling-project/docling/issues/1158)) ([`2f72167`](https://github.com/docling-project/docling/commit/2f72167ff6421424dea4d93018b0d43af16ec153))
|
||||
|
||||
### Fix
|
||||
|
||||
* Determine correct page size in DoclingParseV4Backend ([#1196](https://github.com/docling-project/docling/issues/1196)) ([`f5adfb9`](https://github.com/docling-project/docling/commit/f5adfb9724aae1207f23e21d74033f331e6e1ffb))
|
||||
* **msword:** Fixing function return in equations handling ([#1194](https://github.com/docling-project/docling/issues/1194)) ([`0b707d0`](https://github.com/docling-project/docling/commit/0b707d0882f5be42505871799387d0b1882bffbf))
|
||||
|
||||
### Documentation
|
||||
|
||||
* Linux Foundation AI & Data ([#1183](https://github.com/docling-project/docling/issues/1183)) ([`1d680b0`](https://github.com/docling-project/docling/commit/1d680b0a321d95fc6bd65b7bb4d5e15005a0250a))
|
||||
* Move apify to docs ([#1182](https://github.com/docling-project/docling/issues/1182)) ([`54a78c3`](https://github.com/docling-project/docling/commit/54a78c307de833b93f9b84cf1f8ed6dace8573cb))
|
||||
|
||||
## [v2.27.0](https://github.com/docling-project/docling/releases/tag/v2.27.0) - 2025-03-18
|
||||
|
||||
### Feature
|
||||
|
||||
* Add factory for ocr engines via plugins ([#1010](https://github.com/docling-project/docling/issues/1010)) ([`6eaae3c`](https://github.com/docling-project/docling/commit/6eaae3cba034599020dc06ebdad3bc3ff0b5a8eb))
|
||||
* Add DoclingParseV4 backend, using high-level docling-parse API ([#905](https://github.com/docling-project/docling/issues/905)) ([`3960b19`](https://github.com/docling-project/docling/commit/3960b199d63d0e9d660aeb0cbced02b38bb0b593))
|
||||
* **actor:** Docling Actor on Apify infrastructure ([#875](https://github.com/docling-project/docling/issues/875)) ([`772487f`](https://github.com/docling-project/docling/commit/772487f9c91ad2ee53c591c314c72443f9cbfd23))
|
||||
* Equations to latex in MSWord backend (with inline groups) ([#1114](https://github.com/docling-project/docling/issues/1114)) ([`6eb718f`](https://github.com/docling-project/docling/commit/6eb718f8493038d1b4b6ae836df5a24aa13cd17e))
|
||||
|
||||
### Fix
|
||||
|
||||
* **html:** Handle nested empty lists ([#1154](https://github.com/docling-project/docling/issues/1154)) ([`f94da44`](https://github.com/docling-project/docling/commit/f94da44ec5c7a8c92b9dd60e4df5dc945ed6d1ea))
|
||||
* Use first table row as col headers ([#1156](https://github.com/docling-project/docling/issues/1156)) ([`0945973`](https://github.com/docling-project/docling/commit/0945973b79d67b74281aba5102ee985ac1de74ea))
|
||||
* Pass tests, update docling-core to 2.22.0 ([#1150](https://github.com/docling-project/docling/issues/1150)) ([`aa92a57`](https://github.com/docling-project/docling/commit/aa92a57fa9e7228e894efb9050a0cdb9f287ebfd))
|
||||
|
||||
### Documentation
|
||||
|
||||
* Fix spelling of picture in usage ([#1165](https://github.com/docling-project/docling/issues/1165)) ([`7e01798`](https://github.com/docling-project/docling/commit/7e01798417c424c05685e0ff5f6f89f70dc3bfcd))
|
||||
|
||||
## [v2.26.0](https://github.com/docling-project/docling/releases/tag/v2.26.0) - 2025-03-11
|
||||
|
||||
### Feature
|
||||
|
@ -1,129 +1,3 @@
|
||||
# Contributor Covenant Code of Conduct
|
||||
|
||||
## Our Pledge
|
||||
|
||||
We as members, contributors, and leaders pledge to make participation in our
|
||||
community a harassment-free experience for everyone, regardless of age, body
|
||||
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
||||
identity and expression, level of experience, education, socio-economic status,
|
||||
nationality, personal appearance, race, religion, or sexual identity
|
||||
and orientation.
|
||||
|
||||
We pledge to act and interact in ways that contribute to an open, welcoming,
|
||||
diverse, inclusive, and healthy community.
|
||||
|
||||
## Our Standards
|
||||
|
||||
Examples of behavior that contributes to a positive environment for our
|
||||
community include:
|
||||
|
||||
* Demonstrating empathy and kindness toward other people
|
||||
* Being respectful of differing opinions, viewpoints, and experiences
|
||||
* Giving and gracefully accepting constructive feedback
|
||||
* Accepting responsibility and apologizing to those affected by our mistakes,
|
||||
and learning from the experience
|
||||
* Focusing on what is best not just for us as individuals, but for the
|
||||
overall community
|
||||
|
||||
Examples of unacceptable behavior include:
|
||||
|
||||
* The use of sexualized language or imagery, and sexual attention or
|
||||
advances of any kind
|
||||
* Trolling, insulting or derogatory comments, and personal or political attacks
|
||||
* Public or private harassment
|
||||
* Publishing others' private information, such as a physical or email
|
||||
address, without their explicit permission
|
||||
* Other conduct which could reasonably be considered inappropriate in a
|
||||
professional setting
|
||||
|
||||
## Enforcement Responsibilities
|
||||
|
||||
Community leaders are responsible for clarifying and enforcing our standards of
|
||||
acceptable behavior and will take appropriate and fair corrective action in
|
||||
response to any behavior that they deem inappropriate, threatening, offensive,
|
||||
or harmful.
|
||||
|
||||
Community leaders have the right and responsibility to remove, edit, or reject
|
||||
comments, commits, code, wiki edits, issues, and other contributions that are
|
||||
not aligned to this Code of Conduct, and will communicate reasons for moderation
|
||||
decisions when appropriate.
|
||||
|
||||
## Scope
|
||||
|
||||
This Code of Conduct applies within all community spaces, and also applies when
|
||||
an individual is officially representing the community in public spaces.
|
||||
Examples of representing our community include using an official e-mail address,
|
||||
posting via an official social media account, or acting as an appointed
|
||||
representative at an online or offline event.
|
||||
|
||||
## Enforcement
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported to the community leaders responsible for enforcement using
|
||||
[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
|
||||
|
||||
All complaints will be reviewed and investigated promptly and fairly.
|
||||
|
||||
All community leaders are obligated to respect the privacy and security of the
|
||||
reporter of any incident.
|
||||
|
||||
## Enforcement Guidelines
|
||||
|
||||
Community leaders will follow these Community Impact Guidelines in determining
|
||||
the consequences for any action they deem in violation of this Code of Conduct:
|
||||
|
||||
### 1. Correction
|
||||
|
||||
**Community Impact**: Use of inappropriate language or other behavior deemed
|
||||
unprofessional or unwelcome in the community.
|
||||
|
||||
**Consequence**: A private, written warning from community leaders, providing
|
||||
clarity around the nature of the violation and an explanation of why the
|
||||
behavior was inappropriate. A public apology may be requested.
|
||||
|
||||
### 2. Warning
|
||||
|
||||
**Community Impact**: A violation through a single incident or series
|
||||
of actions.
|
||||
|
||||
**Consequence**: A warning with consequences for continued behavior. No
|
||||
interaction with the people involved, including unsolicited interaction with
|
||||
those enforcing the Code of Conduct, for a specified period of time. This
|
||||
includes avoiding interactions in community spaces as well as external channels
|
||||
like social media. Violating these terms may lead to a temporary or
|
||||
permanent ban.
|
||||
|
||||
### 3. Temporary Ban
|
||||
|
||||
**Community Impact**: A serious violation of community standards, including
|
||||
sustained inappropriate behavior.
|
||||
|
||||
**Consequence**: A temporary ban from any sort of interaction or public
|
||||
communication with the community for a specified period of time. No public or
|
||||
private interaction with the people involved, including unsolicited interaction
|
||||
with those enforcing the Code of Conduct, is allowed during this period.
|
||||
Violating these terms may lead to a permanent ban.
|
||||
|
||||
### 4. Permanent Ban
|
||||
|
||||
**Community Impact**: Demonstrating a pattern of violation of community
|
||||
standards, including sustained inappropriate behavior, harassment of an
|
||||
individual, or aggression toward or disparagement of classes of individuals.
|
||||
|
||||
**Consequence**: A permanent ban from any sort of public interaction within
|
||||
the community.
|
||||
|
||||
## Attribution
|
||||
|
||||
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
||||
version 2.0, available at
|
||||
[https://www.contributor-covenant.org/version/2/0/code_of_conduct.html](https://www.contributor-covenant.org/version/2/0/code_of_conduct.html).
|
||||
|
||||
Community Impact Guidelines were inspired by [Mozilla's code of conduct
|
||||
enforcement ladder](https://github.com/mozilla/diversity).
|
||||
|
||||
Homepage: [https://www.contributor-covenant.org](https://www.contributor-covenant.org)
|
||||
|
||||
For answers to common questions about this code of conduct, see the FAQ at
|
||||
[https://www.contributor-covenant.org/faq](https://www.contributor-covenant.org/faq). Translations are available at
|
||||
[https://www.contributor-covenant.org/translations](https://www.contributor-covenant.org/translations).
|
||||
This project adheres to the [Docling - Code of Conduct and Covenant](https://github.com/docling-project/community/blob/main/CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code.
|
||||
|
@ -2,85 +2,7 @@
|
||||
Our project welcomes external contributions. If you have an itch, please feel
|
||||
free to scratch it.
|
||||
|
||||
To contribute code or documentation, please submit a [pull request](https://github.com/docling-project/docling/pulls).
|
||||
|
||||
A good way to familiarize yourself with the codebase and contribution process is
|
||||
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/docling-project/docling/issues).
|
||||
Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us.
|
||||
|
||||
For general questions or support requests, please refer to the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||
|
||||
**Note: We appreciate your effort and want to avoid situations where a contribution
|
||||
requires extensive rework (by you or by us), sits in the backlog for a long time, or
|
||||
cannot be accepted at all!**
|
||||
|
||||
### Proposing New Features
|
||||
|
||||
If you would like to implement a new feature, please [raise an issue](https://github.com/docling-project/docling/issues)
|
||||
before sending a pull request so the feature can be discussed. This is to avoid
|
||||
you spending valuable time working on a feature that the project developers
|
||||
are not interested in accepting into the codebase.
|
||||
|
||||
### Fixing Bugs
|
||||
|
||||
If you would like to fix a bug, please [raise an issue](https://github.com/docling-project/docling/issues) before sending a
|
||||
pull request so it can be tracked.
|
||||
|
||||
### Merge Approval
|
||||
|
||||
The project maintainers use LGTM (Looks Good To Me) in comments on the code
|
||||
review to indicate acceptance. A change requires LGTMs from two of the
|
||||
maintainers of each component affected.
|
||||
|
||||
For a list of the maintainers, see the [MAINTAINERS.md](MAINTAINERS.md) page.
|
||||
|
||||
|
||||
## Legal
|
||||
|
||||
Each source file must include a license header for the MIT
|
||||
Software. Using the SPDX format is the simplest approach,
|
||||
e.g.
|
||||
|
||||
```
|
||||
/*
|
||||
Copyright IBM Inc. All rights reserved.
|
||||
|
||||
SPDX-License-Identifier: MIT
|
||||
*/
|
||||
```
|
||||
|
||||
We have tried to make it as easy as possible to make contributions. This
|
||||
applies to how we handle the legal aspects of contribution. We use the
|
||||
same approach - the [Developer's Certificate of Origin 1.1 (DCO)](https://github.com/hyperledger/fabric/blob/master/docs/source/DCO1.1.txt) - that the Linux® Kernel [community](https://elinux.org/Developer_Certificate_Of_Origin)
|
||||
uses to manage code contributions.
|
||||
|
||||
We simply ask that when submitting a patch for review, the developer
|
||||
must include a sign-off statement in the commit message.
|
||||
|
||||
Here is an example Signed-off-by line, which indicates that the
|
||||
submitter accepts the DCO:
|
||||
|
||||
```
|
||||
Signed-off-by: John Doe <john.doe@example.com>
|
||||
```
|
||||
|
||||
You can include this automatically when you commit a change to your
|
||||
local git repository using the following command:
|
||||
|
||||
```
|
||||
git commit -s
|
||||
```
|
||||
|
||||
### New dependencies
|
||||
|
||||
This project strictly adheres to using dependencies that are compatible with the MIT license to ensure maximum flexibility and permissiveness in its usage and distribution. As a result, dependencies licensed under restrictive terms such as GPL, LGPL, AGPL, or similar are explicitly excluded. These licenses impose additional requirements and limitations that are incompatible with the MIT license's minimal restrictions, potentially affecting derivative works and redistribution. By maintaining this policy, the project ensures simplicity and freedom for both developers and users, avoiding conflicts with stricter copyleft provisions.
|
||||
|
||||
|
||||
## Communication
|
||||
|
||||
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||
|
||||
|
||||
For more details on the contributing guidelines head to the Docling Project [community repository](https://github.com/docling-project/community).
|
||||
|
||||
## Developing
|
||||
|
||||
|
@ -4,7 +4,7 @@ ENV GIT_SSH_COMMAND="ssh -o StrictHostKeyChecking=no"
|
||||
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y libgl1 libglib2.0-0 curl wget git procps \
|
||||
&& apt-get clean
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# This will install torch with *only* cpu support
|
||||
# Remove the --extra-index-url part if you want to install all the gpu requirements
|
||||
|
@ -2,9 +2,6 @@
|
||||
|
||||
- Christoph Auer - [@cau-git](https://github.com/cau-git)
|
||||
- Michele Dolfi - [@dolfim-ibm](https://github.com/dolfim-ibm)
|
||||
- Maxim Lysak - [@maxmnemonic](https://github.com/maxmnemonic)
|
||||
- Nikos Livathinos - [@nikos-livathinos](https://github.com/nikos-livathinos)
|
||||
- Ahmed Nassar - [@nassarofficial](https://github.com/nassarofficial)
|
||||
- Panos Vagenas - [@vagenas](https://github.com/vagenas)
|
||||
- Peter Staar - [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
|
||||
|
||||
|
30
README.md
30
README.md
@ -21,6 +21,8 @@
|
||||
[](https://github.com/pre-commit/pre-commit)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://pepy.tech/projects/docling)
|
||||
[](https://apify.com/vancura/docling)
|
||||
[](https://lfaidata.foundation/projects/)
|
||||
|
||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||
|
||||
@ -33,12 +35,12 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||
* 🥚 Support of Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🆕
|
||||
* 💻 Simple and convenient CLI
|
||||
|
||||
### Coming soon
|
||||
|
||||
* 📝 Metadata extraction, including title, authors, references & language
|
||||
* 📝 Inclusion of Visual Language Models ([SmolDocling](https://huggingface.co/blog/smolervlm#smoldocling))
|
||||
* 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
|
||||
* 📝 Complex chemistry understanding (Molecular structures)
|
||||
|
||||
@ -55,7 +57,7 @@ More [detailed installation instructions](https://docling-project.github.io/docl
|
||||
|
||||
## Getting started
|
||||
|
||||
To convert individual documents, use `convert()`, for example:
|
||||
To convert individual documents with python, use `convert()`, for example:
|
||||
|
||||
```python
|
||||
from docling.document_converter import DocumentConverter
|
||||
@ -69,6 +71,22 @@ print(result.document.export_to_markdown()) # output: "## Docling Technical Rep
|
||||
More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
|
||||
the docs.
|
||||
|
||||
## CLI
|
||||
|
||||
Docling has a built-in CLI to run conversions.
|
||||
|
||||
```bash
|
||||
docling https://arxiv.org/pdf/2206.01062
|
||||
```
|
||||
|
||||
You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
|
||||
```bash
|
||||
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
|
||||
```
|
||||
This will use MLX acceleration on supported Apple Silicon hardware.
|
||||
|
||||
Read more [here](https://docling-project.github.io/docling/usage/)
|
||||
|
||||
## Documentation
|
||||
|
||||
Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
|
||||
@ -119,9 +137,13 @@ If you use Docling in your projects, please consider citing the following:
|
||||
The Docling codebase is under MIT license.
|
||||
For individual model usage, please refer to the model licenses found in the original packages.
|
||||
|
||||
## IBM ❤️ Open Source AI
|
||||
## LF AI & Data
|
||||
|
||||
Docling has been brought to you by IBM.
|
||||
Docling is hosted as a project in the [LF AI & Data Foundation](https://lfaidata.foundation/projects/).
|
||||
|
||||
### IBM ❤️ Open Source AI
|
||||
|
||||
The project was started by the AI for knowledge team at IBM Research Zurich.
|
||||
|
||||
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
|
||||
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
|
||||
|
@ -112,23 +112,30 @@ class DoclingParseV4PageBackend(PdfPageBackend):
|
||||
padbox.r = page_size.width - padbox.r
|
||||
padbox.t = page_size.height - padbox.t
|
||||
|
||||
image = (
|
||||
self._ppage.render(
|
||||
scale=scale * 1.5,
|
||||
rotation=0, # no additional rotation
|
||||
crop=padbox.as_tuple(),
|
||||
)
|
||||
.to_pil()
|
||||
.resize(size=(round(cropbox.width * scale), round(cropbox.height * scale)))
|
||||
) # We resize the image from 1.5x the given scale to make it sharper.
|
||||
with pypdfium2_lock:
|
||||
image = (
|
||||
self._ppage.render(
|
||||
scale=scale * 1.5,
|
||||
rotation=0, # no additional rotation
|
||||
crop=padbox.as_tuple(),
|
||||
)
|
||||
.to_pil()
|
||||
.resize(
|
||||
size=(round(cropbox.width * scale), round(cropbox.height * scale))
|
||||
)
|
||||
) # We resize the image from 1.5x the given scale to make it sharper.
|
||||
|
||||
return image
|
||||
|
||||
def get_size(self) -> Size:
|
||||
return Size(
|
||||
width=self._dpage.dimension.width,
|
||||
height=self._dpage.dimension.height,
|
||||
)
|
||||
with pypdfium2_lock:
|
||||
return Size(width=self._ppage.get_width(), height=self._ppage.get_height())
|
||||
|
||||
# TODO: Take width and height from docling-parse.
|
||||
# return Size(
|
||||
# width=self._dpage.dimension.width,
|
||||
# height=self._dpage.dimension.height,
|
||||
# )
|
||||
|
||||
def unload(self):
|
||||
self._ppage = None
|
||||
|
@ -16,6 +16,7 @@ from docling_core.types.doc import (
|
||||
TableCell,
|
||||
TableData,
|
||||
)
|
||||
from docling_core.types.doc.document import ContentLayer
|
||||
from PIL import Image, UnidentifiedImageError
|
||||
from pptx import Presentation
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE, PP_PLACEHOLDER
|
||||
@ -421,4 +422,21 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
||||
for shape in slide.shapes:
|
||||
handle_shapes(shape, parent_slide, slide_ind, doc, slide_size)
|
||||
|
||||
# Handle notes slide
|
||||
if slide.has_notes_slide:
|
||||
notes_slide = slide.notes_slide
|
||||
notes_text = notes_slide.notes_text_frame.text.strip()
|
||||
if notes_text:
|
||||
bbox = BoundingBox(l=0, t=0, r=0, b=0)
|
||||
prov = ProvenanceItem(
|
||||
page_no=slide_ind + 1, charspan=[0, len(notes_text)], bbox=bbox
|
||||
)
|
||||
doc.add_text(
|
||||
label=DocItemLabel.TEXT,
|
||||
parent=parent_slide,
|
||||
text=notes_text,
|
||||
prov=prov,
|
||||
content_layer=ContentLayer.FURNITURE,
|
||||
)
|
||||
|
||||
return doc
|
||||
|
@ -53,6 +53,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
self.max_levels: int = 10
|
||||
self.level_at_new_list: Optional[int] = None
|
||||
self.parents: dict[int, Optional[NodeItem]] = {}
|
||||
self.numbered_headers: dict[int, int] = {}
|
||||
for i in range(-1, self.max_levels):
|
||||
self.parents[i] = None
|
||||
|
||||
@ -275,8 +276,10 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
only_equations.append(latex_equation)
|
||||
texts_and_equations.append(latex_equation)
|
||||
|
||||
if "".join(only_texts) != text:
|
||||
return text
|
||||
if "".join(only_texts).strip() != text.strip():
|
||||
# If we are not able to reconstruct the initial raw text
|
||||
# do not try to parse equations and return the original
|
||||
return text, []
|
||||
|
||||
return "".join(texts_and_equations), only_equations
|
||||
|
||||
@ -344,7 +347,14 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
parent=None, label=DocItemLabel.TITLE, text=text
|
||||
)
|
||||
elif "Heading" in p_style_id:
|
||||
self.add_header(doc, p_level, text)
|
||||
style_element = getattr(paragraph.style, "element", None)
|
||||
if style_element:
|
||||
is_numbered_style = (
|
||||
"<w:numPr>" in style_element.xml or "<w:numPr>" in element.xml
|
||||
)
|
||||
else:
|
||||
is_numbered_style = False
|
||||
self.add_header(doc, p_level, text, is_numbered_style)
|
||||
|
||||
elif len(equations) > 0:
|
||||
if (raw_text is None or len(raw_text) == 0) and len(text) > 0:
|
||||
@ -365,6 +375,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
for eq in equations:
|
||||
if len(text_tmp) == 0:
|
||||
break
|
||||
|
||||
pre_eq_text = text_tmp.split(eq, maxsplit=1)[0]
|
||||
text_tmp = text_tmp.split(eq, maxsplit=1)[1]
|
||||
if len(pre_eq_text) > 0:
|
||||
@ -412,7 +423,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
return
|
||||
|
||||
def add_header(
|
||||
self, doc: DoclingDocument, curr_level: Optional[int], text: str
|
||||
self,
|
||||
doc: DoclingDocument,
|
||||
curr_level: Optional[int],
|
||||
text: str,
|
||||
is_numbered_style: bool = False,
|
||||
) -> None:
|
||||
level = self.get_level()
|
||||
if isinstance(curr_level, int):
|
||||
@ -430,17 +445,44 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
if key >= curr_level:
|
||||
self.parents[key] = None
|
||||
|
||||
self.parents[curr_level] = doc.add_heading(
|
||||
parent=self.parents[curr_level - 1],
|
||||
text=text,
|
||||
level=curr_level,
|
||||
)
|
||||
current_level = curr_level
|
||||
parent_level = curr_level - 1
|
||||
add_level = curr_level
|
||||
else:
|
||||
self.parents[self.level] = doc.add_heading(
|
||||
parent=self.parents[self.level - 1],
|
||||
text=text,
|
||||
level=1,
|
||||
)
|
||||
current_level = self.level
|
||||
parent_level = self.level - 1
|
||||
add_level = 1
|
||||
|
||||
if is_numbered_style:
|
||||
if add_level in self.numbered_headers:
|
||||
self.numbered_headers[add_level] += 1
|
||||
else:
|
||||
self.numbered_headers[add_level] = 1
|
||||
text = f"{self.numbered_headers[add_level]} {text}"
|
||||
|
||||
# Reset deeper levels
|
||||
next_level = add_level + 1
|
||||
while next_level in self.numbered_headers:
|
||||
self.numbered_headers[next_level] = 0
|
||||
next_level += 1
|
||||
|
||||
# Scan upper levels
|
||||
previous_level = add_level - 1
|
||||
while previous_level in self.numbered_headers:
|
||||
# MSWord convention: no empty sublevels
|
||||
# I.e., sub-sub section (2.0.1) without a sub-section (2.1)
|
||||
# is processed as 2.1.1
|
||||
if self.numbered_headers[previous_level] == 0:
|
||||
self.numbered_headers[previous_level] += 1
|
||||
|
||||
text = f"{self.numbered_headers[previous_level]}.{text}"
|
||||
previous_level -= 1
|
||||
|
||||
self.parents[current_level] = doc.add_heading(
|
||||
parent=self.parents[parent_level],
|
||||
text=text,
|
||||
level=add_level,
|
||||
)
|
||||
return
|
||||
|
||||
def add_listitem(
|
||||
|
@ -9,6 +9,7 @@ import warnings
|
||||
from pathlib import Path
|
||||
from typing import Annotated, Dict, Iterable, List, Optional, Type
|
||||
|
||||
import rich.table
|
||||
import typer
|
||||
from docling_core.types.doc import ImageRefMode
|
||||
from docling_core.utils.file import resolve_source_to_path
|
||||
@ -30,18 +31,22 @@ from docling.datamodel.pipeline_options import (
|
||||
AcceleratorDevice,
|
||||
AcceleratorOptions,
|
||||
EasyOcrOptions,
|
||||
OcrEngine,
|
||||
OcrMacOptions,
|
||||
OcrOptions,
|
||||
PaginatedPipelineOptions,
|
||||
PdfBackend,
|
||||
PdfPipeline,
|
||||
PdfPipelineOptions,
|
||||
RapidOcrOptions,
|
||||
TableFormerMode,
|
||||
TesseractCliOcrOptions,
|
||||
TesseractOcrOptions,
|
||||
VlmModelType,
|
||||
VlmPipelineOptions,
|
||||
granite_vision_vlm_conversion_options,
|
||||
smoldocling_vlm_conversion_options,
|
||||
smoldocling_vlm_mlx_conversion_options,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption
|
||||
from docling.models.factories import get_ocr_factory
|
||||
from docling.pipeline.vlm_pipeline import VlmPipeline
|
||||
|
||||
warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
|
||||
warnings.filterwarnings(action="ignore", category=FutureWarning, module="easyocr")
|
||||
@ -49,8 +54,11 @@ warnings.filterwarnings(action="ignore", category=FutureWarning, module="easyocr
|
||||
_log = logging.getLogger(__name__)
|
||||
from rich.console import Console
|
||||
|
||||
console = Console()
|
||||
err_console = Console(stderr=True)
|
||||
|
||||
ocr_factory_internal = get_ocr_factory(allow_external_plugins=False)
|
||||
ocr_engines_enum_internal = ocr_factory_internal.get_enum()
|
||||
|
||||
app = typer.Typer(
|
||||
name="Docling",
|
||||
@ -78,6 +86,24 @@ def version_callback(value: bool):
|
||||
raise typer.Exit()
|
||||
|
||||
|
||||
def show_external_plugins_callback(value: bool):
|
||||
if value:
|
||||
ocr_factory_all = get_ocr_factory(allow_external_plugins=True)
|
||||
table = rich.table.Table(title="Available OCR engines")
|
||||
table.add_column("Name", justify="right")
|
||||
table.add_column("Plugin")
|
||||
table.add_column("Package")
|
||||
for meta in ocr_factory_all.registered_meta.values():
|
||||
if not meta.module.startswith("docling."):
|
||||
table.add_row(
|
||||
f"[bold]{meta.kind}[/bold]",
|
||||
meta.plugin_name,
|
||||
meta.module.split(".")[0],
|
||||
)
|
||||
rich.print(table)
|
||||
raise typer.Exit()
|
||||
|
||||
|
||||
def export_documents(
|
||||
conv_results: Iterable[ConversionResult],
|
||||
output_dir: Path,
|
||||
@ -182,6 +208,14 @@ def convert(
|
||||
help="Image export mode for the document (only in case of JSON, Markdown or HTML). With `placeholder`, only the position of the image is marked in the output. In `embedded` mode, the image is embedded as base64 encoded string. In `referenced` mode, the image is exported in PNG format and referenced from the main exported document.",
|
||||
),
|
||||
] = ImageRefMode.EMBEDDED,
|
||||
pipeline: Annotated[
|
||||
PdfPipeline,
|
||||
typer.Option(..., help="Choose the pipeline to process PDF or image files."),
|
||||
] = PdfPipeline.STANDARD,
|
||||
vlm_model: Annotated[
|
||||
VlmModelType,
|
||||
typer.Option(..., help="Choose the VLM model to use with PDF or image files."),
|
||||
] = VlmModelType.SMOLDOCLING,
|
||||
ocr: Annotated[
|
||||
bool,
|
||||
typer.Option(
|
||||
@ -196,8 +230,16 @@ def convert(
|
||||
),
|
||||
] = False,
|
||||
ocr_engine: Annotated[
|
||||
OcrEngine, typer.Option(..., help="The OCR engine to use.")
|
||||
] = OcrEngine.EASYOCR,
|
||||
str,
|
||||
typer.Option(
|
||||
...,
|
||||
help=(
|
||||
f"The OCR engine to use. When --allow-external-plugins is *not* set, the available values are: "
|
||||
f"{', '.join((o.value for o in ocr_engines_enum_internal))}. "
|
||||
f"Use the option --show-external-plugins to see the options allowed with external plugins."
|
||||
),
|
||||
),
|
||||
] = EasyOcrOptions.kind,
|
||||
ocr_lang: Annotated[
|
||||
Optional[str],
|
||||
typer.Option(
|
||||
@ -241,6 +283,21 @@ def convert(
|
||||
..., help="Must be enabled when using models connecting to remote services."
|
||||
),
|
||||
] = False,
|
||||
allow_external_plugins: Annotated[
|
||||
bool,
|
||||
typer.Option(
|
||||
..., help="Must be enabled for loading modules from third-party plugins."
|
||||
),
|
||||
] = False,
|
||||
show_external_plugins: Annotated[
|
||||
bool,
|
||||
typer.Option(
|
||||
...,
|
||||
help="List the third-party plugins which are available when the option --allow-external-plugins is set.",
|
||||
callback=show_external_plugins_callback,
|
||||
is_eager=True,
|
||||
),
|
||||
] = False,
|
||||
abort_on_error: Annotated[
|
||||
bool,
|
||||
typer.Option(
|
||||
@ -368,67 +425,88 @@ def convert(
|
||||
export_txt = OutputFormat.TEXT in to_formats
|
||||
export_doctags = OutputFormat.DOCTAGS in to_formats
|
||||
|
||||
if ocr_engine == OcrEngine.EASYOCR:
|
||||
ocr_options: OcrOptions = EasyOcrOptions(force_full_page_ocr=force_ocr)
|
||||
elif ocr_engine == OcrEngine.TESSERACT_CLI:
|
||||
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=force_ocr)
|
||||
elif ocr_engine == OcrEngine.TESSERACT:
|
||||
ocr_options = TesseractOcrOptions(force_full_page_ocr=force_ocr)
|
||||
elif ocr_engine == OcrEngine.OCRMAC:
|
||||
ocr_options = OcrMacOptions(force_full_page_ocr=force_ocr)
|
||||
elif ocr_engine == OcrEngine.RAPIDOCR:
|
||||
ocr_options = RapidOcrOptions(force_full_page_ocr=force_ocr)
|
||||
else:
|
||||
raise RuntimeError(f"Unexpected OCR engine type {ocr_engine}")
|
||||
ocr_factory = get_ocr_factory(allow_external_plugins=allow_external_plugins)
|
||||
ocr_options: OcrOptions = ocr_factory.create_options( # type: ignore
|
||||
kind=ocr_engine,
|
||||
force_full_page_ocr=force_ocr,
|
||||
)
|
||||
|
||||
ocr_lang_list = _split_list(ocr_lang)
|
||||
if ocr_lang_list is not None:
|
||||
ocr_options.lang = ocr_lang_list
|
||||
|
||||
accelerator_options = AcceleratorOptions(num_threads=num_threads, device=device)
|
||||
pipeline_options = PdfPipelineOptions(
|
||||
enable_remote_services=enable_remote_services,
|
||||
accelerator_options=accelerator_options,
|
||||
do_ocr=ocr,
|
||||
ocr_options=ocr_options,
|
||||
do_table_structure=True,
|
||||
do_code_enrichment=enrich_code,
|
||||
do_formula_enrichment=enrich_formula,
|
||||
do_picture_description=enrich_picture_description,
|
||||
do_picture_classification=enrich_picture_classes,
|
||||
document_timeout=document_timeout,
|
||||
)
|
||||
pipeline_options.table_structure_options.do_cell_matching = (
|
||||
True # do_cell_matching
|
||||
)
|
||||
pipeline_options.table_structure_options.mode = table_mode
|
||||
pipeline_options: PaginatedPipelineOptions
|
||||
|
||||
if image_export_mode != ImageRefMode.PLACEHOLDER:
|
||||
pipeline_options.generate_page_images = True
|
||||
pipeline_options.generate_picture_images = (
|
||||
True # FIXME: to be deprecated in verson 3
|
||||
if pipeline == PdfPipeline.STANDARD:
|
||||
pipeline_options = PdfPipelineOptions(
|
||||
allow_external_plugins=allow_external_plugins,
|
||||
enable_remote_services=enable_remote_services,
|
||||
accelerator_options=accelerator_options,
|
||||
do_ocr=ocr,
|
||||
ocr_options=ocr_options,
|
||||
do_table_structure=True,
|
||||
do_code_enrichment=enrich_code,
|
||||
do_formula_enrichment=enrich_formula,
|
||||
do_picture_description=enrich_picture_description,
|
||||
do_picture_classification=enrich_picture_classes,
|
||||
document_timeout=document_timeout,
|
||||
)
|
||||
pipeline_options.table_structure_options.do_cell_matching = (
|
||||
True # do_cell_matching
|
||||
)
|
||||
pipeline_options.table_structure_options.mode = table_mode
|
||||
|
||||
if image_export_mode != ImageRefMode.PLACEHOLDER:
|
||||
pipeline_options.generate_page_images = True
|
||||
pipeline_options.generate_picture_images = (
|
||||
True # FIXME: to be deprecated in verson 3
|
||||
)
|
||||
pipeline_options.images_scale = 2
|
||||
|
||||
backend: Type[PdfDocumentBackend]
|
||||
if pdf_backend == PdfBackend.DLPARSE_V1:
|
||||
backend = DoclingParseDocumentBackend
|
||||
elif pdf_backend == PdfBackend.DLPARSE_V2:
|
||||
backend = DoclingParseV2DocumentBackend
|
||||
elif pdf_backend == PdfBackend.DLPARSE_V4:
|
||||
backend = DoclingParseV4DocumentBackend # type: ignore
|
||||
elif pdf_backend == PdfBackend.PYPDFIUM2:
|
||||
backend = PyPdfiumDocumentBackend # type: ignore
|
||||
else:
|
||||
raise RuntimeError(f"Unexpected PDF backend type {pdf_backend}")
|
||||
|
||||
pdf_format_option = PdfFormatOption(
|
||||
pipeline_options=pipeline_options,
|
||||
backend=backend, # pdf_backend
|
||||
)
|
||||
elif pipeline == PdfPipeline.VLM:
|
||||
pipeline_options = VlmPipelineOptions()
|
||||
|
||||
if vlm_model == VlmModelType.GRANITE_VISION:
|
||||
pipeline_options.vlm_options = granite_vision_vlm_conversion_options
|
||||
elif vlm_model == VlmModelType.SMOLDOCLING:
|
||||
pipeline_options.vlm_options = smoldocling_vlm_conversion_options
|
||||
if sys.platform == "darwin":
|
||||
try:
|
||||
import mlx_vlm
|
||||
|
||||
pipeline_options.vlm_options = (
|
||||
smoldocling_vlm_mlx_conversion_options
|
||||
)
|
||||
except ImportError:
|
||||
_log.warning(
|
||||
"To run SmolDocling faster, please install mlx-vlm:\n"
|
||||
"pip install mlx-vlm"
|
||||
)
|
||||
|
||||
pdf_format_option = PdfFormatOption(
|
||||
pipeline_cls=VlmPipeline, pipeline_options=pipeline_options
|
||||
)
|
||||
pipeline_options.images_scale = 2
|
||||
|
||||
if artifacts_path is not None:
|
||||
pipeline_options.artifacts_path = artifacts_path
|
||||
|
||||
backend: Type[PdfDocumentBackend]
|
||||
if pdf_backend == PdfBackend.DLPARSE_V1:
|
||||
backend = DoclingParseDocumentBackend
|
||||
elif pdf_backend == PdfBackend.DLPARSE_V2:
|
||||
backend = DoclingParseV2DocumentBackend
|
||||
elif pdf_backend == PdfBackend.DLPARSE_V4:
|
||||
backend = DoclingParseV4DocumentBackend # type: ignore
|
||||
elif pdf_backend == PdfBackend.PYPDFIUM2:
|
||||
backend = PyPdfiumDocumentBackend # type: ignore
|
||||
else:
|
||||
raise RuntimeError(f"Unexpected PDF backend type {pdf_backend}")
|
||||
|
||||
pdf_format_option = PdfFormatOption(
|
||||
pipeline_options=pipeline_options,
|
||||
backend=backend, # pdf_backend
|
||||
)
|
||||
format_options: Dict[InputFormat, FormatOption] = {
|
||||
InputFormat.PDF: pdf_format_option,
|
||||
InputFormat.IMAGE: pdf_format_option,
|
||||
|
@ -1,10 +1,9 @@
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import warnings
|
||||
from enum import Enum
|
||||
from pathlib import Path
|
||||
from typing import Annotated, Any, Dict, List, Literal, Optional, Union
|
||||
from typing import Any, ClassVar, Dict, List, Literal, Optional, Union
|
||||
|
||||
from pydantic import (
|
||||
AnyUrl,
|
||||
@ -13,13 +12,8 @@ from pydantic import (
|
||||
Field,
|
||||
field_validator,
|
||||
model_validator,
|
||||
validator,
|
||||
)
|
||||
from pydantic_settings import (
|
||||
BaseSettings,
|
||||
PydanticBaseSettingsSource,
|
||||
SettingsConfigDict,
|
||||
)
|
||||
from pydantic_settings import BaseSettings, SettingsConfigDict
|
||||
from typing_extensions import deprecated
|
||||
|
||||
_log = logging.getLogger(__name__)
|
||||
@ -83,6 +77,12 @@ class AcceleratorOptions(BaseSettings):
|
||||
return data
|
||||
|
||||
|
||||
class BaseOptions(BaseModel):
|
||||
"""Base class for options."""
|
||||
|
||||
kind: ClassVar[str]
|
||||
|
||||
|
||||
class TableFormerMode(str, Enum):
|
||||
"""Modes for the TableFormer model."""
|
||||
|
||||
@ -102,10 +102,9 @@ class TableStructureOptions(BaseModel):
|
||||
mode: TableFormerMode = TableFormerMode.ACCURATE
|
||||
|
||||
|
||||
class OcrOptions(BaseModel):
|
||||
class OcrOptions(BaseOptions):
|
||||
"""OCR options."""
|
||||
|
||||
kind: str
|
||||
lang: List[str]
|
||||
force_full_page_ocr: bool = False # If enabled a full page OCR is always applied
|
||||
bitmap_area_threshold: float = (
|
||||
@ -116,7 +115,7 @@ class OcrOptions(BaseModel):
|
||||
class RapidOcrOptions(OcrOptions):
|
||||
"""Options for the RapidOCR engine."""
|
||||
|
||||
kind: Literal["rapidocr"] = "rapidocr"
|
||||
kind: ClassVar[Literal["rapidocr"]] = "rapidocr"
|
||||
|
||||
# English and chinese are the most commly used models and have been tested with RapidOCR.
|
||||
lang: List[str] = [
|
||||
@ -155,7 +154,7 @@ class RapidOcrOptions(OcrOptions):
|
||||
class EasyOcrOptions(OcrOptions):
|
||||
"""Options for the EasyOCR engine."""
|
||||
|
||||
kind: Literal["easyocr"] = "easyocr"
|
||||
kind: ClassVar[Literal["easyocr"]] = "easyocr"
|
||||
lang: List[str] = ["fr", "de", "es", "en"]
|
||||
|
||||
use_gpu: Optional[bool] = None
|
||||
@ -175,7 +174,7 @@ class EasyOcrOptions(OcrOptions):
|
||||
class TesseractCliOcrOptions(OcrOptions):
|
||||
"""Options for the TesseractCli engine."""
|
||||
|
||||
kind: Literal["tesseract"] = "tesseract"
|
||||
kind: ClassVar[Literal["tesseract"]] = "tesseract"
|
||||
lang: List[str] = ["fra", "deu", "spa", "eng"]
|
||||
tesseract_cmd: str = "tesseract"
|
||||
path: Optional[str] = None
|
||||
@ -188,7 +187,7 @@ class TesseractCliOcrOptions(OcrOptions):
|
||||
class TesseractOcrOptions(OcrOptions):
|
||||
"""Options for the Tesseract engine."""
|
||||
|
||||
kind: Literal["tesserocr"] = "tesserocr"
|
||||
kind: ClassVar[Literal["tesserocr"]] = "tesserocr"
|
||||
lang: List[str] = ["fra", "deu", "spa", "eng"]
|
||||
path: Optional[str] = None
|
||||
|
||||
@ -200,7 +199,7 @@ class TesseractOcrOptions(OcrOptions):
|
||||
class OcrMacOptions(OcrOptions):
|
||||
"""Options for the Mac OCR engine."""
|
||||
|
||||
kind: Literal["ocrmac"] = "ocrmac"
|
||||
kind: ClassVar[Literal["ocrmac"]] = "ocrmac"
|
||||
lang: List[str] = ["fr-FR", "de-DE", "es-ES", "en-US"]
|
||||
recognition: str = "accurate"
|
||||
framework: str = "vision"
|
||||
@ -210,8 +209,7 @@ class OcrMacOptions(OcrOptions):
|
||||
)
|
||||
|
||||
|
||||
class PictureDescriptionBaseOptions(BaseModel):
|
||||
kind: str
|
||||
class PictureDescriptionBaseOptions(BaseOptions):
|
||||
batch_size: int = 8
|
||||
scale: float = 2
|
||||
|
||||
@ -221,7 +219,7 @@ class PictureDescriptionBaseOptions(BaseModel):
|
||||
|
||||
|
||||
class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):
|
||||
kind: Literal["api"] = "api"
|
||||
kind: ClassVar[Literal["api"]] = "api"
|
||||
|
||||
url: AnyUrl = AnyUrl("http://localhost:8000/v1/chat/completions")
|
||||
headers: Dict[str, str] = {}
|
||||
@ -233,7 +231,7 @@ class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):
|
||||
|
||||
|
||||
class PictureDescriptionVlmOptions(PictureDescriptionBaseOptions):
|
||||
kind: Literal["vlm"] = "vlm"
|
||||
kind: ClassVar[Literal["vlm"]] = "vlm"
|
||||
|
||||
repo_id: str
|
||||
prompt: str = "Describe this image in a few sentences."
|
||||
@ -265,6 +263,11 @@ class ResponseFormat(str, Enum):
|
||||
MARKDOWN = "markdown"
|
||||
|
||||
|
||||
class InferenceFramework(str, Enum):
|
||||
MLX = "mlx"
|
||||
TRANSFORMERS = "transformers"
|
||||
|
||||
|
||||
class HuggingFaceVlmOptions(BaseVlmOptions):
|
||||
kind: Literal["hf_model_options"] = "hf_model_options"
|
||||
|
||||
@ -273,6 +276,7 @@ class HuggingFaceVlmOptions(BaseVlmOptions):
|
||||
llm_int8_threshold: float = 6.0
|
||||
quantized: bool = False
|
||||
|
||||
inference_framework: InferenceFramework
|
||||
response_format: ResponseFormat
|
||||
|
||||
@property
|
||||
@ -280,10 +284,19 @@ class HuggingFaceVlmOptions(BaseVlmOptions):
|
||||
return self.repo_id.replace("/", "--")
|
||||
|
||||
|
||||
smoldocling_vlm_mlx_conversion_options = HuggingFaceVlmOptions(
|
||||
repo_id="ds4sd/SmolDocling-256M-preview-mlx-bf16",
|
||||
prompt="Convert this page to docling.",
|
||||
response_format=ResponseFormat.DOCTAGS,
|
||||
inference_framework=InferenceFramework.MLX,
|
||||
)
|
||||
|
||||
|
||||
smoldocling_vlm_conversion_options = HuggingFaceVlmOptions(
|
||||
repo_id="ds4sd/SmolDocling-256M-preview",
|
||||
prompt="Convert this page to docling.",
|
||||
response_format=ResponseFormat.DOCTAGS,
|
||||
inference_framework=InferenceFramework.TRANSFORMERS,
|
||||
)
|
||||
|
||||
granite_vision_vlm_conversion_options = HuggingFaceVlmOptions(
|
||||
@ -291,9 +304,15 @@ granite_vision_vlm_conversion_options = HuggingFaceVlmOptions(
|
||||
# prompt="OCR the full page to markdown.",
|
||||
prompt="OCR this image.",
|
||||
response_format=ResponseFormat.MARKDOWN,
|
||||
inference_framework=InferenceFramework.TRANSFORMERS,
|
||||
)
|
||||
|
||||
|
||||
class VlmModelType(str, Enum):
|
||||
SMOLDOCLING = "smoldocling"
|
||||
GRANITE_VISION = "granite_vision"
|
||||
|
||||
|
||||
# Define an enum for the backend options
|
||||
class PdfBackend(str, Enum):
|
||||
"""Enum of valid PDF backends."""
|
||||
@ -305,6 +324,7 @@ class PdfBackend(str, Enum):
|
||||
|
||||
|
||||
# Define an enum for the ocr engines
|
||||
@deprecated("Use ocr_factory.registered_enum")
|
||||
class OcrEngine(str, Enum):
|
||||
"""Enum of valid OCR engines."""
|
||||
|
||||
@ -324,16 +344,18 @@ class PipelineOptions(BaseModel):
|
||||
document_timeout: Optional[float] = None
|
||||
accelerator_options: AcceleratorOptions = AcceleratorOptions()
|
||||
enable_remote_services: bool = False
|
||||
allow_external_plugins: bool = False
|
||||
|
||||
|
||||
class PaginatedPipelineOptions(PipelineOptions):
|
||||
artifacts_path: Optional[Union[Path, str]] = None
|
||||
|
||||
images_scale: float = 1.0
|
||||
generate_page_images: bool = False
|
||||
generate_picture_images: bool = False
|
||||
|
||||
|
||||
class VlmPipelineOptions(PaginatedPipelineOptions):
|
||||
artifacts_path: Optional[Union[Path, str]] = None
|
||||
|
||||
generate_page_images: bool = True
|
||||
force_backend_text: bool = (
|
||||
@ -346,7 +368,6 @@ class VlmPipelineOptions(PaginatedPipelineOptions):
|
||||
class PdfPipelineOptions(PaginatedPipelineOptions):
|
||||
"""Options for the PDF pipeline."""
|
||||
|
||||
artifacts_path: Optional[Union[Path, str]] = None
|
||||
do_table_structure: bool = True # True: perform table structure extraction
|
||||
do_ocr: bool = True # True: perform OCR, replace programmatic PDF text
|
||||
do_code_enrichment: bool = False # True: perform code OCR
|
||||
@ -359,17 +380,10 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
|
||||
# If True, text from backend will be used instead of generated text
|
||||
|
||||
table_structure_options: TableStructureOptions = TableStructureOptions()
|
||||
ocr_options: Union[
|
||||
EasyOcrOptions,
|
||||
TesseractCliOcrOptions,
|
||||
TesseractOcrOptions,
|
||||
OcrMacOptions,
|
||||
RapidOcrOptions,
|
||||
] = Field(EasyOcrOptions(), discriminator="kind")
|
||||
picture_description_options: Annotated[
|
||||
Union[PictureDescriptionApiOptions, PictureDescriptionVlmOptions],
|
||||
Field(discriminator="kind"),
|
||||
] = smolvlm_picture_description
|
||||
ocr_options: OcrOptions = EasyOcrOptions()
|
||||
picture_description_options: PictureDescriptionBaseOptions = (
|
||||
smolvlm_picture_description
|
||||
)
|
||||
|
||||
images_scale: float = 1.0
|
||||
generate_page_images: bool = False
|
||||
@ -384,3 +398,8 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
|
||||
)
|
||||
|
||||
generate_parsed_pages: bool = False
|
||||
|
||||
|
||||
class PdfPipeline(str, Enum):
|
||||
STANDARD = "standard"
|
||||
VLM = "vlm"
|
||||
|
@ -1,3 +1,4 @@
|
||||
import hashlib
|
||||
import logging
|
||||
import math
|
||||
import sys
|
||||
@ -181,7 +182,14 @@ class DocumentConverter:
|
||||
)
|
||||
for format in self.allowed_formats
|
||||
}
|
||||
self.initialized_pipelines: Dict[Type[BasePipeline], BasePipeline] = {}
|
||||
self.initialized_pipelines: Dict[
|
||||
Tuple[Type[BasePipeline], str], BasePipeline
|
||||
] = {}
|
||||
|
||||
def _get_pipeline_options_hash(self, pipeline_options: PipelineOptions) -> str:
|
||||
"""Generate a hash of pipeline options to use as part of the cache key."""
|
||||
options_str = str(pipeline_options.model_dump())
|
||||
return hashlib.md5(options_str.encode("utf-8")).hexdigest()
|
||||
|
||||
def initialize_pipeline(self, format: InputFormat):
|
||||
"""Initialize the conversion pipeline for the selected format."""
|
||||
@ -279,31 +287,36 @@ class DocumentConverter:
|
||||
yield item
|
||||
|
||||
def _get_pipeline(self, doc_format: InputFormat) -> Optional[BasePipeline]:
|
||||
"""Retrieve or initialize a pipeline, reusing instances based on class and options."""
|
||||
fopt = self.format_to_options.get(doc_format)
|
||||
|
||||
if fopt is None:
|
||||
if fopt is None or fopt.pipeline_options is None:
|
||||
return None
|
||||
else:
|
||||
pipeline_class = fopt.pipeline_cls
|
||||
pipeline_options = fopt.pipeline_options
|
||||
|
||||
if pipeline_options is None:
|
||||
return None
|
||||
# TODO this will ignore if different options have been defined for the same pipeline class.
|
||||
if (
|
||||
pipeline_class not in self.initialized_pipelines
|
||||
or self.initialized_pipelines[pipeline_class].pipeline_options
|
||||
!= pipeline_options
|
||||
):
|
||||
self.initialized_pipelines[pipeline_class] = pipeline_class(
|
||||
pipeline_class = fopt.pipeline_cls
|
||||
pipeline_options = fopt.pipeline_options
|
||||
options_hash = self._get_pipeline_options_hash(pipeline_options)
|
||||
|
||||
# Use a composite key to cache pipelines
|
||||
cache_key = (pipeline_class, options_hash)
|
||||
|
||||
if cache_key not in self.initialized_pipelines:
|
||||
_log.info(
|
||||
f"Initializing pipeline for {pipeline_class.__name__} with options hash {options_hash}"
|
||||
)
|
||||
self.initialized_pipelines[cache_key] = pipeline_class(
|
||||
pipeline_options=pipeline_options
|
||||
)
|
||||
return self.initialized_pipelines[pipeline_class]
|
||||
else:
|
||||
_log.debug(
|
||||
f"Reusing cached pipeline for {pipeline_class.__name__} with options hash {options_hash}"
|
||||
)
|
||||
|
||||
return self.initialized_pipelines[cache_key]
|
||||
|
||||
def _process_document(
|
||||
self, in_doc: InputDocument, raises_on_error: bool
|
||||
) -> ConversionResult:
|
||||
|
||||
valid = (
|
||||
self.allowed_formats is not None and in_doc.format in self.allowed_formats
|
||||
)
|
||||
@ -345,7 +358,6 @@ class DocumentConverter:
|
||||
else:
|
||||
if raises_on_error:
|
||||
raise ConversionError(f"Input document {in_doc.file} is not valid.")
|
||||
|
||||
else:
|
||||
# invalid doc or not of desired format
|
||||
conv_res = ConversionResult(
|
||||
|
@ -1,14 +1,22 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Any, Generic, Iterable, Optional
|
||||
from typing import Any, Generic, Iterable, Optional, Protocol, Type
|
||||
|
||||
from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem
|
||||
from typing_extensions import TypeVar
|
||||
|
||||
from docling.datamodel.base_models import ItemAndImageEnrichmentElement, Page
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import BaseOptions
|
||||
from docling.datamodel.settings import settings
|
||||
|
||||
|
||||
class BaseModelWithOptions(Protocol):
|
||||
@classmethod
|
||||
def get_options_type(cls) -> Type[BaseOptions]: ...
|
||||
|
||||
def __init__(self, *, options: BaseOptions, **kwargs): ...
|
||||
|
||||
|
||||
class BasePageModel(ABC):
|
||||
@abstractmethod
|
||||
def __call__(
|
||||
|
@ -2,7 +2,7 @@ import copy
|
||||
import logging
|
||||
from abc import abstractmethod
|
||||
from pathlib import Path
|
||||
from typing import Iterable, List
|
||||
from typing import Iterable, List, Optional, Type
|
||||
|
||||
import numpy as np
|
||||
from docling_core.types.doc import BoundingBox, CoordOrigin
|
||||
@ -13,15 +13,22 @@ from scipy.ndimage import binary_dilation, find_objects, label
|
||||
|
||||
from docling.datamodel.base_models import Page
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import OcrOptions
|
||||
from docling.datamodel.pipeline_options import AcceleratorOptions, OcrOptions
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.models.base_model import BasePageModel
|
||||
from docling.models.base_model import BaseModelWithOptions, BasePageModel
|
||||
|
||||
_log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class BaseOcrModel(BasePageModel):
|
||||
def __init__(self, enabled: bool, options: OcrOptions):
|
||||
class BaseOcrModel(BasePageModel, BaseModelWithOptions):
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
enabled: bool,
|
||||
artifacts_path: Optional[Path],
|
||||
options: OcrOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
self.enabled = enabled
|
||||
self.options = options
|
||||
|
||||
@ -186,3 +193,8 @@ class BaseOcrModel(BasePageModel):
|
||||
self, conv_res: ConversionResult, page_batch: Iterable[Page]
|
||||
) -> Iterable[Page]:
|
||||
pass
|
||||
|
||||
@classmethod
|
||||
@abstractmethod
|
||||
def get_options_type(cls) -> Type[OcrOptions]:
|
||||
pass
|
||||
|
@ -2,7 +2,7 @@ import logging
|
||||
import warnings
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from typing import Iterable, List, Optional
|
||||
from typing import Iterable, List, Optional, Type
|
||||
|
||||
import numpy
|
||||
from docling_core.types.doc import BoundingBox, CoordOrigin
|
||||
@ -14,6 +14,7 @@ from docling.datamodel.pipeline_options import (
|
||||
AcceleratorDevice,
|
||||
AcceleratorOptions,
|
||||
EasyOcrOptions,
|
||||
OcrOptions,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.models.base_ocr_model import BaseOcrModel
|
||||
@ -34,7 +35,12 @@ class EasyOcrModel(BaseOcrModel):
|
||||
options: EasyOcrOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
super().__init__(enabled=enabled, options=options)
|
||||
super().__init__(
|
||||
enabled=enabled,
|
||||
artifacts_path=artifacts_path,
|
||||
options=options,
|
||||
accelerator_options=accelerator_options,
|
||||
)
|
||||
self.options: EasyOcrOptions
|
||||
|
||||
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
|
||||
@ -180,3 +186,7 @@ class EasyOcrModel(BaseOcrModel):
|
||||
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
|
||||
|
||||
yield page
|
||||
|
||||
@classmethod
|
||||
def get_options_type(cls) -> Type[OcrOptions]:
|
||||
return EasyOcrOptions
|
||||
|
27
docling/models/factories/__init__.py
Normal file
27
docling/models/factories/__init__.py
Normal file
@ -0,0 +1,27 @@
|
||||
import logging
|
||||
from functools import lru_cache
|
||||
|
||||
from docling.models.factories.ocr_factory import OcrFactory
|
||||
from docling.models.factories.picture_description_factory import (
|
||||
PictureDescriptionFactory,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@lru_cache()
|
||||
def get_ocr_factory(allow_external_plugins: bool = False) -> OcrFactory:
|
||||
factory = OcrFactory()
|
||||
factory.load_from_plugins(allow_external_plugins=allow_external_plugins)
|
||||
logger.info("Registered ocr engines: %r", factory.registered_kind)
|
||||
return factory
|
||||
|
||||
|
||||
@lru_cache()
|
||||
def get_picture_description_factory(
|
||||
allow_external_plugins: bool = False,
|
||||
) -> PictureDescriptionFactory:
|
||||
factory = PictureDescriptionFactory()
|
||||
factory.load_from_plugins(allow_external_plugins=allow_external_plugins)
|
||||
logger.info("Registered picture descriptions: %r", factory.registered_kind)
|
||||
return factory
|
122
docling/models/factories/base_factory.py
Normal file
122
docling/models/factories/base_factory.py
Normal file
@ -0,0 +1,122 @@
|
||||
import enum
|
||||
import logging
|
||||
from abc import ABCMeta
|
||||
from typing import Generic, Optional, Type, TypeVar
|
||||
|
||||
from pluggy import PluginManager
|
||||
from pydantic import BaseModel
|
||||
|
||||
from docling.datamodel.pipeline_options import BaseOptions
|
||||
from docling.models.base_model import BaseModelWithOptions
|
||||
|
||||
A = TypeVar("A", bound=BaseModelWithOptions)
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class FactoryMeta(BaseModel):
|
||||
kind: str
|
||||
plugin_name: str
|
||||
module: str
|
||||
|
||||
|
||||
class BaseFactory(Generic[A], metaclass=ABCMeta):
|
||||
default_plugin_name = "docling"
|
||||
|
||||
def __init__(self, plugin_attr_name: str, plugin_name=default_plugin_name):
|
||||
self.plugin_name = plugin_name
|
||||
self.plugin_attr_name = plugin_attr_name
|
||||
|
||||
self._classes: dict[Type[BaseOptions], Type[A]] = {}
|
||||
self._meta: dict[Type[BaseOptions], FactoryMeta] = {}
|
||||
|
||||
@property
|
||||
def registered_kind(self) -> list[str]:
|
||||
return list(opt.kind for opt in self._classes.keys())
|
||||
|
||||
def get_enum(self) -> enum.Enum:
|
||||
return enum.Enum(
|
||||
self.plugin_attr_name + "_enum",
|
||||
names={kind: kind for kind in self.registered_kind},
|
||||
type=str,
|
||||
module=__name__,
|
||||
)
|
||||
|
||||
@property
|
||||
def classes(self):
|
||||
return self._classes
|
||||
|
||||
@property
|
||||
def registered_meta(self):
|
||||
return self._meta
|
||||
|
||||
def create_instance(self, options: BaseOptions, **kwargs) -> A:
|
||||
try:
|
||||
_cls = self._classes[type(options)]
|
||||
return _cls(options=options, **kwargs)
|
||||
except KeyError:
|
||||
raise RuntimeError(self._err_msg_on_class_not_found(options.kind))
|
||||
|
||||
def create_options(self, kind: str, *args, **kwargs) -> BaseOptions:
|
||||
for opt_cls, _ in self._classes.items():
|
||||
if opt_cls.kind == kind:
|
||||
return opt_cls(*args, **kwargs)
|
||||
raise RuntimeError(self._err_msg_on_class_not_found(kind))
|
||||
|
||||
def _err_msg_on_class_not_found(self, kind: str):
|
||||
msg = []
|
||||
|
||||
for opt, cls in self._classes.items():
|
||||
msg.append(f"\t{opt.kind!r} => {cls!r}")
|
||||
|
||||
msg_str = "\n".join(msg)
|
||||
|
||||
return f"No class found with the name {kind!r}, known classes are:\n{msg_str}"
|
||||
|
||||
def register(self, cls: Type[A], plugin_name: str, plugin_module_name: str):
|
||||
opt_type = cls.get_options_type()
|
||||
|
||||
if opt_type in self._classes:
|
||||
raise ValueError(
|
||||
f"{opt_type.kind!r} already registered to class {self._classes[opt_type]!r}"
|
||||
)
|
||||
|
||||
self._classes[opt_type] = cls
|
||||
self._meta[opt_type] = FactoryMeta(
|
||||
kind=opt_type.kind, plugin_name=plugin_name, module=plugin_module_name
|
||||
)
|
||||
|
||||
def load_from_plugins(
|
||||
self, plugin_name: Optional[str] = None, allow_external_plugins: bool = False
|
||||
):
|
||||
plugin_name = plugin_name or self.plugin_name
|
||||
|
||||
plugin_manager = PluginManager(plugin_name)
|
||||
plugin_manager.load_setuptools_entrypoints(plugin_name)
|
||||
|
||||
for plugin_name, plugin_module in plugin_manager.list_name_plugin():
|
||||
plugin_module_name = str(plugin_module.__name__) # type: ignore
|
||||
|
||||
if not allow_external_plugins and not plugin_module_name.startswith(
|
||||
"docling."
|
||||
):
|
||||
logger.warning(
|
||||
f"The plugin {plugin_name} will not be loaded because Docling is being executed with allow_external_plugins=false."
|
||||
)
|
||||
continue
|
||||
|
||||
attr = getattr(plugin_module, self.plugin_attr_name, None)
|
||||
|
||||
if callable(attr):
|
||||
logger.info("Loading plugin %r", plugin_name)
|
||||
|
||||
config = attr()
|
||||
self.process_plugin(config, plugin_name, plugin_module_name)
|
||||
|
||||
def process_plugin(self, config, plugin_name: str, plugin_module_name: str):
|
||||
for item in config[self.plugin_attr_name]:
|
||||
try:
|
||||
self.register(item, plugin_name, plugin_module_name)
|
||||
except ValueError:
|
||||
logger.warning("%r already registered", item)
|
11
docling/models/factories/ocr_factory.py
Normal file
11
docling/models/factories/ocr_factory.py
Normal file
@ -0,0 +1,11 @@
|
||||
import logging
|
||||
|
||||
from docling.models.base_ocr_model import BaseOcrModel
|
||||
from docling.models.factories.base_factory import BaseFactory
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class OcrFactory(BaseFactory[BaseOcrModel]):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__("ocr_engines", *args, **kwargs)
|
11
docling/models/factories/picture_description_factory.py
Normal file
11
docling/models/factories/picture_description_factory.py
Normal file
@ -0,0 +1,11 @@
|
||||
import logging
|
||||
|
||||
from docling.models.factories.base_factory import BaseFactory
|
||||
from docling.models.picture_description_base_model import PictureDescriptionBaseModel
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class PictureDescriptionFactory(BaseFactory[PictureDescriptionBaseModel]):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__("picture_description", *args, **kwargs)
|
137
docling/models/hf_mlx_model.py
Normal file
137
docling/models/hf_mlx_model.py
Normal file
@ -0,0 +1,137 @@
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Iterable, List, Optional
|
||||
|
||||
from docling.datamodel.base_models import Page, VlmPrediction
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AcceleratorDevice,
|
||||
AcceleratorOptions,
|
||||
HuggingFaceVlmOptions,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.models.base_model import BasePageModel
|
||||
from docling.utils.accelerator_utils import decide_device
|
||||
from docling.utils.profiling import TimeRecorder
|
||||
|
||||
_log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class HuggingFaceMlxModel(BasePageModel):
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
enabled: bool,
|
||||
artifacts_path: Optional[Path],
|
||||
accelerator_options: AcceleratorOptions,
|
||||
vlm_options: HuggingFaceVlmOptions,
|
||||
):
|
||||
self.enabled = enabled
|
||||
|
||||
self.vlm_options = vlm_options
|
||||
|
||||
if self.enabled:
|
||||
|
||||
try:
|
||||
from mlx_vlm import generate, load # type: ignore
|
||||
from mlx_vlm.prompt_utils import apply_chat_template # type: ignore
|
||||
from mlx_vlm.utils import load_config, stream_generate # type: ignore
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"mlx-vlm is not installed. Please install it via `pip install mlx-vlm` to use MLX VLM models."
|
||||
)
|
||||
|
||||
repo_cache_folder = vlm_options.repo_id.replace("/", "--")
|
||||
self.apply_chat_template = apply_chat_template
|
||||
self.stream_generate = stream_generate
|
||||
|
||||
# PARAMETERS:
|
||||
if artifacts_path is None:
|
||||
artifacts_path = self.download_models(self.vlm_options.repo_id)
|
||||
elif (artifacts_path / repo_cache_folder).exists():
|
||||
artifacts_path = artifacts_path / repo_cache_folder
|
||||
|
||||
self.param_question = vlm_options.prompt # "Perform Layout Analysis."
|
||||
|
||||
## Load the model
|
||||
self.vlm_model, self.processor = load(artifacts_path)
|
||||
self.config = load_config(artifacts_path)
|
||||
|
||||
@staticmethod
|
||||
def download_models(
|
||||
repo_id: str,
|
||||
local_dir: Optional[Path] = None,
|
||||
force: bool = False,
|
||||
progress: bool = False,
|
||||
) -> Path:
|
||||
from huggingface_hub import snapshot_download
|
||||
from huggingface_hub.utils import disable_progress_bars
|
||||
|
||||
if not progress:
|
||||
disable_progress_bars()
|
||||
download_path = snapshot_download(
|
||||
repo_id=repo_id,
|
||||
force_download=force,
|
||||
local_dir=local_dir,
|
||||
# revision="v0.0.1",
|
||||
)
|
||||
|
||||
return Path(download_path)
|
||||
|
||||
def __call__(
|
||||
self, conv_res: ConversionResult, page_batch: Iterable[Page]
|
||||
) -> Iterable[Page]:
|
||||
for page in page_batch:
|
||||
assert page._backend is not None
|
||||
if not page._backend.is_valid():
|
||||
yield page
|
||||
else:
|
||||
with TimeRecorder(conv_res, "vlm"):
|
||||
assert page.size is not None
|
||||
|
||||
hi_res_image = page.get_image(scale=2.0) # 144dpi
|
||||
# hi_res_image = page.get_image(scale=1.0) # 72dpi
|
||||
|
||||
if hi_res_image is not None:
|
||||
im_width, im_height = hi_res_image.size
|
||||
|
||||
# populate page_tags with predicted doc tags
|
||||
page_tags = ""
|
||||
|
||||
if hi_res_image:
|
||||
if hi_res_image.mode != "RGB":
|
||||
hi_res_image = hi_res_image.convert("RGB")
|
||||
|
||||
prompt = self.apply_chat_template(
|
||||
self.processor, self.config, self.param_question, num_images=1
|
||||
)
|
||||
|
||||
start_time = time.time()
|
||||
# Call model to generate:
|
||||
output = ""
|
||||
for token in self.stream_generate(
|
||||
self.vlm_model,
|
||||
self.processor,
|
||||
prompt,
|
||||
[hi_res_image],
|
||||
max_tokens=4096,
|
||||
verbose=False,
|
||||
):
|
||||
output += token.text
|
||||
if "</doctag>" in token.text:
|
||||
break
|
||||
|
||||
generation_time = time.time() - start_time
|
||||
page_tags = output
|
||||
|
||||
# inference_time = time.time() - start_time
|
||||
# tokens_per_second = num_tokens / generation_time
|
||||
# print("")
|
||||
# print(f"Page Inference Time: {inference_time:.2f} seconds")
|
||||
# print(f"Total tokens on page: {num_tokens:.2f}")
|
||||
# print(f"Tokens/sec: {tokens_per_second:.2f}")
|
||||
# print("")
|
||||
page.predictions.vlm_response = VlmPrediction(text=page_tags)
|
||||
|
||||
yield page
|
@ -1,13 +1,19 @@
|
||||
import logging
|
||||
import sys
|
||||
import tempfile
|
||||
from typing import Iterable, Optional, Tuple
|
||||
from pathlib import Path
|
||||
from typing import Iterable, Optional, Tuple, Type
|
||||
|
||||
from docling_core.types.doc import BoundingBox, CoordOrigin
|
||||
from docling_core.types.doc.page import BoundingRectangle, TextCell
|
||||
|
||||
from docling.datamodel.base_models import Page
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import OcrMacOptions
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AcceleratorOptions,
|
||||
OcrMacOptions,
|
||||
OcrOptions,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.models.base_ocr_model import BaseOcrModel
|
||||
from docling.utils.profiling import TimeRecorder
|
||||
@ -16,13 +22,26 @@ _log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class OcrMacModel(BaseOcrModel):
|
||||
def __init__(self, enabled: bool, options: OcrMacOptions):
|
||||
super().__init__(enabled=enabled, options=options)
|
||||
def __init__(
|
||||
self,
|
||||
enabled: bool,
|
||||
artifacts_path: Optional[Path],
|
||||
options: OcrMacOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
super().__init__(
|
||||
enabled=enabled,
|
||||
artifacts_path=artifacts_path,
|
||||
options=options,
|
||||
accelerator_options=accelerator_options,
|
||||
)
|
||||
self.options: OcrMacOptions
|
||||
|
||||
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
|
||||
|
||||
if self.enabled:
|
||||
if "darwin" != sys.platform:
|
||||
raise RuntimeError(f"OcrMac is only supported on Mac.")
|
||||
install_errmsg = (
|
||||
"ocrmac is not correctly installed. "
|
||||
"Please install it via `pip install ocrmac` to use this OCR engine. "
|
||||
@ -121,3 +140,7 @@ class OcrMacModel(BaseOcrModel):
|
||||
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
|
||||
|
||||
yield page
|
||||
|
||||
@classmethod
|
||||
def get_options_type(cls) -> Type[OcrOptions]:
|
||||
return OcrMacOptions
|
||||
|
@ -63,7 +63,13 @@ class PagePreprocessingModel(BasePageModel):
|
||||
def draw_text_boxes(image, cells, show: bool = False):
|
||||
draw = ImageDraw.Draw(image)
|
||||
for c in cells:
|
||||
x0, y0, x1, y1 = c.bbox.as_tuple()
|
||||
x0, y0, x1, y1 = (
|
||||
c.to_bounding_box().l,
|
||||
c.to_bounding_box().t,
|
||||
c.to_bounding_box().r,
|
||||
c.to_bounding_box().b,
|
||||
)
|
||||
|
||||
draw.rectangle([(x0, y0), (x1, y1)], outline="red")
|
||||
if show:
|
||||
image.show()
|
||||
|
@ -1,13 +1,18 @@
|
||||
import base64
|
||||
import io
|
||||
import logging
|
||||
from typing import Iterable, List, Optional
|
||||
from pathlib import Path
|
||||
from typing import Iterable, List, Optional, Type, Union
|
||||
|
||||
import requests
|
||||
from PIL import Image
|
||||
from pydantic import BaseModel, ConfigDict
|
||||
|
||||
from docling.datamodel.pipeline_options import PictureDescriptionApiOptions
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AcceleratorOptions,
|
||||
PictureDescriptionApiOptions,
|
||||
PictureDescriptionBaseOptions,
|
||||
)
|
||||
from docling.exceptions import OperationNotAllowed
|
||||
from docling.models.picture_description_base_model import PictureDescriptionBaseModel
|
||||
|
||||
@ -46,13 +51,25 @@ class ApiResponse(BaseModel):
|
||||
class PictureDescriptionApiModel(PictureDescriptionBaseModel):
|
||||
# elements_batch_size = 4
|
||||
|
||||
@classmethod
|
||||
def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
|
||||
return PictureDescriptionApiOptions
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
enabled: bool,
|
||||
enable_remote_services: bool,
|
||||
artifacts_path: Optional[Union[Path, str]],
|
||||
options: PictureDescriptionApiOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
super().__init__(enabled=enabled, options=options)
|
||||
super().__init__(
|
||||
enabled=enabled,
|
||||
enable_remote_services=enable_remote_services,
|
||||
artifacts_path=artifacts_path,
|
||||
options=options,
|
||||
accelerator_options=accelerator_options,
|
||||
)
|
||||
self.options: PictureDescriptionApiOptions
|
||||
|
||||
if self.enabled:
|
||||
|
@ -1,6 +1,7 @@
|
||||
import logging
|
||||
from abc import abstractmethod
|
||||
from pathlib import Path
|
||||
from typing import Any, Iterable, List, Optional, Union
|
||||
from typing import Any, Iterable, List, Optional, Type, Union
|
||||
|
||||
from docling_core.types.doc import (
|
||||
DoclingDocument,
|
||||
@ -13,20 +14,30 @@ from docling_core.types.doc.document import ( # TODO: move import to docling_co
|
||||
)
|
||||
from PIL import Image
|
||||
|
||||
from docling.datamodel.pipeline_options import PictureDescriptionBaseOptions
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AcceleratorOptions,
|
||||
PictureDescriptionBaseOptions,
|
||||
)
|
||||
from docling.models.base_model import (
|
||||
BaseItemAndImageEnrichmentModel,
|
||||
BaseModelWithOptions,
|
||||
ItemAndImageEnrichmentElement,
|
||||
)
|
||||
|
||||
|
||||
class PictureDescriptionBaseModel(BaseItemAndImageEnrichmentModel):
|
||||
class PictureDescriptionBaseModel(
|
||||
BaseItemAndImageEnrichmentModel, BaseModelWithOptions
|
||||
):
|
||||
images_scale: float = 2.0
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
enabled: bool,
|
||||
enable_remote_services: bool,
|
||||
artifacts_path: Optional[Union[Path, str]],
|
||||
options: PictureDescriptionBaseOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
self.enabled = enabled
|
||||
self.options = options
|
||||
@ -62,3 +73,8 @@ class PictureDescriptionBaseModel(BaseItemAndImageEnrichmentModel):
|
||||
PictureDescriptionData(text=output, provenance=self.provenance)
|
||||
)
|
||||
yield item
|
||||
|
||||
@classmethod
|
||||
@abstractmethod
|
||||
def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
|
||||
pass
|
||||
|
@ -1,10 +1,11 @@
|
||||
from pathlib import Path
|
||||
from typing import Iterable, Optional, Union
|
||||
from typing import Iterable, Optional, Type, Union
|
||||
|
||||
from PIL import Image
|
||||
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AcceleratorOptions,
|
||||
PictureDescriptionBaseOptions,
|
||||
PictureDescriptionVlmOptions,
|
||||
)
|
||||
from docling.models.picture_description_base_model import PictureDescriptionBaseModel
|
||||
@ -13,14 +14,25 @@ from docling.utils.accelerator_utils import decide_device
|
||||
|
||||
class PictureDescriptionVlmModel(PictureDescriptionBaseModel):
|
||||
|
||||
@classmethod
|
||||
def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
|
||||
return PictureDescriptionVlmOptions
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
enabled: bool,
|
||||
enable_remote_services: bool,
|
||||
artifacts_path: Optional[Union[Path, str]],
|
||||
options: PictureDescriptionVlmOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
super().__init__(enabled=enabled, options=options)
|
||||
super().__init__(
|
||||
enabled=enabled,
|
||||
enable_remote_services=enable_remote_services,
|
||||
artifacts_path=artifacts_path,
|
||||
options=options,
|
||||
accelerator_options=accelerator_options,
|
||||
)
|
||||
self.options: PictureDescriptionVlmOptions
|
||||
|
||||
if self.enabled:
|
||||
|
0
docling/models/plugins/__init__.py
Normal file
0
docling/models/plugins/__init__.py
Normal file
28
docling/models/plugins/defaults.py
Normal file
28
docling/models/plugins/defaults.py
Normal file
@ -0,0 +1,28 @@
|
||||
from docling.models.easyocr_model import EasyOcrModel
|
||||
from docling.models.ocr_mac_model import OcrMacModel
|
||||
from docling.models.picture_description_api_model import PictureDescriptionApiModel
|
||||
from docling.models.picture_description_vlm_model import PictureDescriptionVlmModel
|
||||
from docling.models.rapid_ocr_model import RapidOcrModel
|
||||
from docling.models.tesseract_ocr_cli_model import TesseractOcrCliModel
|
||||
from docling.models.tesseract_ocr_model import TesseractOcrModel
|
||||
|
||||
|
||||
def ocr_engines():
|
||||
return {
|
||||
"ocr_engines": [
|
||||
EasyOcrModel,
|
||||
OcrMacModel,
|
||||
RapidOcrModel,
|
||||
TesseractOcrModel,
|
||||
TesseractOcrCliModel,
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
def picture_description():
|
||||
return {
|
||||
"picture_description": [
|
||||
PictureDescriptionVlmModel,
|
||||
PictureDescriptionApiModel,
|
||||
]
|
||||
}
|
@ -1,5 +1,6 @@
|
||||
import logging
|
||||
from typing import Iterable
|
||||
from pathlib import Path
|
||||
from typing import Iterable, Optional, Type
|
||||
|
||||
import numpy
|
||||
from docling_core.types.doc import BoundingBox, CoordOrigin
|
||||
@ -10,6 +11,7 @@ from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AcceleratorDevice,
|
||||
AcceleratorOptions,
|
||||
OcrOptions,
|
||||
RapidOcrOptions,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
@ -24,10 +26,16 @@ class RapidOcrModel(BaseOcrModel):
|
||||
def __init__(
|
||||
self,
|
||||
enabled: bool,
|
||||
artifacts_path: Optional[Path],
|
||||
options: RapidOcrOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
super().__init__(enabled=enabled, options=options)
|
||||
super().__init__(
|
||||
enabled=enabled,
|
||||
artifacts_path=artifacts_path,
|
||||
options=options,
|
||||
accelerator_options=accelerator_options,
|
||||
)
|
||||
self.options: RapidOcrOptions
|
||||
|
||||
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
|
||||
@ -135,3 +143,7 @@ class RapidOcrModel(BaseOcrModel):
|
||||
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
|
||||
|
||||
yield page
|
||||
|
||||
@classmethod
|
||||
def get_options_type(cls) -> Type[OcrOptions]:
|
||||
return RapidOcrOptions
|
||||
|
@ -3,8 +3,9 @@ import io
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from subprocess import DEVNULL, PIPE, Popen
|
||||
from typing import Iterable, List, Optional, Tuple
|
||||
from typing import Iterable, List, Optional, Tuple, Type
|
||||
|
||||
import pandas as pd
|
||||
from docling_core.types.doc import BoundingBox, CoordOrigin
|
||||
@ -12,7 +13,11 @@ from docling_core.types.doc.page import BoundingRectangle, TextCell
|
||||
|
||||
from docling.datamodel.base_models import Page
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import TesseractCliOcrOptions
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AcceleratorOptions,
|
||||
OcrOptions,
|
||||
TesseractCliOcrOptions,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.models.base_ocr_model import BaseOcrModel
|
||||
from docling.utils.ocr_utils import map_tesseract_script
|
||||
@ -22,8 +27,19 @@ _log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class TesseractOcrCliModel(BaseOcrModel):
|
||||
def __init__(self, enabled: bool, options: TesseractCliOcrOptions):
|
||||
super().__init__(enabled=enabled, options=options)
|
||||
def __init__(
|
||||
self,
|
||||
enabled: bool,
|
||||
artifacts_path: Optional[Path],
|
||||
options: TesseractCliOcrOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
super().__init__(
|
||||
enabled=enabled,
|
||||
artifacts_path=artifacts_path,
|
||||
options=options,
|
||||
accelerator_options=accelerator_options,
|
||||
)
|
||||
self.options: TesseractCliOcrOptions
|
||||
|
||||
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
|
||||
@ -257,3 +273,7 @@ class TesseractOcrCliModel(BaseOcrModel):
|
||||
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
|
||||
|
||||
yield page
|
||||
|
||||
@classmethod
|
||||
def get_options_type(cls) -> Type[OcrOptions]:
|
||||
return TesseractCliOcrOptions
|
||||
|
@ -1,12 +1,17 @@
|
||||
import logging
|
||||
from typing import Iterable
|
||||
from pathlib import Path
|
||||
from typing import Iterable, Optional, Type
|
||||
|
||||
from docling_core.types.doc import BoundingBox, CoordOrigin
|
||||
from docling_core.types.doc.page import BoundingRectangle, TextCell
|
||||
|
||||
from docling.datamodel.base_models import Page
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import TesseractOcrOptions
|
||||
from docling.datamodel.pipeline_options import (
|
||||
AcceleratorOptions,
|
||||
OcrOptions,
|
||||
TesseractOcrOptions,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.models.base_ocr_model import BaseOcrModel
|
||||
from docling.utils.ocr_utils import map_tesseract_script
|
||||
@ -16,8 +21,19 @@ _log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class TesseractOcrModel(BaseOcrModel):
|
||||
def __init__(self, enabled: bool, options: TesseractOcrOptions):
|
||||
super().__init__(enabled=enabled, options=options)
|
||||
def __init__(
|
||||
self,
|
||||
enabled: bool,
|
||||
artifacts_path: Optional[Path],
|
||||
options: TesseractOcrOptions,
|
||||
accelerator_options: AcceleratorOptions,
|
||||
):
|
||||
super().__init__(
|
||||
enabled=enabled,
|
||||
artifacts_path=artifacts_path,
|
||||
options=options,
|
||||
accelerator_options=accelerator_options,
|
||||
)
|
||||
self.options: TesseractOcrOptions
|
||||
|
||||
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
|
||||
@ -200,3 +216,7 @@ class TesseractOcrModel(BaseOcrModel):
|
||||
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
|
||||
|
||||
yield page
|
||||
|
||||
@classmethod
|
||||
def get_options_type(cls) -> Type[OcrOptions]:
|
||||
return TesseractOcrOptions
|
||||
|
@ -10,16 +10,7 @@ from docling.backend.abstract_backend import AbstractDocumentBackend
|
||||
from docling.backend.pdf_backend import PdfDocumentBackend
|
||||
from docling.datamodel.base_models import AssembledUnit, Page
|
||||
from docling.datamodel.document import ConversionResult
|
||||
from docling.datamodel.pipeline_options import (
|
||||
EasyOcrOptions,
|
||||
OcrMacOptions,
|
||||
PdfPipelineOptions,
|
||||
PictureDescriptionApiOptions,
|
||||
PictureDescriptionVlmOptions,
|
||||
RapidOcrOptions,
|
||||
TesseractCliOcrOptions,
|
||||
TesseractOcrOptions,
|
||||
)
|
||||
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.models.base_ocr_model import BaseOcrModel
|
||||
from docling.models.code_formula_model import CodeFormulaModel, CodeFormulaModelOptions
|
||||
@ -27,22 +18,16 @@ from docling.models.document_picture_classifier import (
|
||||
DocumentPictureClassifier,
|
||||
DocumentPictureClassifierOptions,
|
||||
)
|
||||
from docling.models.easyocr_model import EasyOcrModel
|
||||
from docling.models.factories import get_ocr_factory, get_picture_description_factory
|
||||
from docling.models.layout_model import LayoutModel
|
||||
from docling.models.ocr_mac_model import OcrMacModel
|
||||
from docling.models.page_assemble_model import PageAssembleModel, PageAssembleOptions
|
||||
from docling.models.page_preprocessing_model import (
|
||||
PagePreprocessingModel,
|
||||
PagePreprocessingOptions,
|
||||
)
|
||||
from docling.models.picture_description_api_model import PictureDescriptionApiModel
|
||||
from docling.models.picture_description_base_model import PictureDescriptionBaseModel
|
||||
from docling.models.picture_description_vlm_model import PictureDescriptionVlmModel
|
||||
from docling.models.rapid_ocr_model import RapidOcrModel
|
||||
from docling.models.readingorder_model import ReadingOrderModel, ReadingOrderOptions
|
||||
from docling.models.table_structure_model import TableStructureModel
|
||||
from docling.models.tesseract_ocr_cli_model import TesseractOcrCliModel
|
||||
from docling.models.tesseract_ocr_model import TesseractOcrModel
|
||||
from docling.pipeline.base_pipeline import PaginatedPipeline
|
||||
from docling.utils.model_downloader import download_models
|
||||
from docling.utils.profiling import ProfilingScope, TimeRecorder
|
||||
@ -78,10 +63,7 @@ class StandardPdfPipeline(PaginatedPipeline):
|
||||
|
||||
self.glm_model = ReadingOrderModel(options=ReadingOrderOptions())
|
||||
|
||||
if (ocr_model := self.get_ocr_model(artifacts_path=artifacts_path)) is None:
|
||||
raise RuntimeError(
|
||||
f"The specified OCR kind is not supported: {pipeline_options.ocr_options.kind}."
|
||||
)
|
||||
ocr_model = self.get_ocr_model(artifacts_path=artifacts_path)
|
||||
|
||||
self.build_pipe = [
|
||||
# Pre-processing
|
||||
@ -164,66 +146,30 @@ class StandardPdfPipeline(PaginatedPipeline):
|
||||
output_dir = download_models(output_dir=local_dir, force=force, progress=False)
|
||||
return output_dir
|
||||
|
||||
def get_ocr_model(
|
||||
self, artifacts_path: Optional[Path] = None
|
||||
) -> Optional[BaseOcrModel]:
|
||||
if isinstance(self.pipeline_options.ocr_options, EasyOcrOptions):
|
||||
return EasyOcrModel(
|
||||
enabled=self.pipeline_options.do_ocr,
|
||||
artifacts_path=artifacts_path,
|
||||
options=self.pipeline_options.ocr_options,
|
||||
accelerator_options=self.pipeline_options.accelerator_options,
|
||||
)
|
||||
elif isinstance(self.pipeline_options.ocr_options, TesseractCliOcrOptions):
|
||||
return TesseractOcrCliModel(
|
||||
enabled=self.pipeline_options.do_ocr,
|
||||
options=self.pipeline_options.ocr_options,
|
||||
)
|
||||
elif isinstance(self.pipeline_options.ocr_options, TesseractOcrOptions):
|
||||
return TesseractOcrModel(
|
||||
enabled=self.pipeline_options.do_ocr,
|
||||
options=self.pipeline_options.ocr_options,
|
||||
)
|
||||
elif isinstance(self.pipeline_options.ocr_options, RapidOcrOptions):
|
||||
return RapidOcrModel(
|
||||
enabled=self.pipeline_options.do_ocr,
|
||||
options=self.pipeline_options.ocr_options,
|
||||
accelerator_options=self.pipeline_options.accelerator_options,
|
||||
)
|
||||
elif isinstance(self.pipeline_options.ocr_options, OcrMacOptions):
|
||||
if "darwin" != sys.platform:
|
||||
raise RuntimeError(
|
||||
f"The specified OCR type is only supported on Mac: {self.pipeline_options.ocr_options.kind}."
|
||||
)
|
||||
return OcrMacModel(
|
||||
enabled=self.pipeline_options.do_ocr,
|
||||
options=self.pipeline_options.ocr_options,
|
||||
)
|
||||
return None
|
||||
def get_ocr_model(self, artifacts_path: Optional[Path] = None) -> BaseOcrModel:
|
||||
factory = get_ocr_factory(
|
||||
allow_external_plugins=self.pipeline_options.allow_external_plugins
|
||||
)
|
||||
return factory.create_instance(
|
||||
options=self.pipeline_options.ocr_options,
|
||||
enabled=self.pipeline_options.do_ocr,
|
||||
artifacts_path=artifacts_path,
|
||||
accelerator_options=self.pipeline_options.accelerator_options,
|
||||
)
|
||||
|
||||
def get_picture_description_model(
|
||||
self, artifacts_path: Optional[Path] = None
|
||||
) -> Optional[PictureDescriptionBaseModel]:
|
||||
if isinstance(
|
||||
self.pipeline_options.picture_description_options,
|
||||
PictureDescriptionApiOptions,
|
||||
):
|
||||
return PictureDescriptionApiModel(
|
||||
enabled=self.pipeline_options.do_picture_description,
|
||||
enable_remote_services=self.pipeline_options.enable_remote_services,
|
||||
options=self.pipeline_options.picture_description_options,
|
||||
)
|
||||
elif isinstance(
|
||||
self.pipeline_options.picture_description_options,
|
||||
PictureDescriptionVlmOptions,
|
||||
):
|
||||
return PictureDescriptionVlmModel(
|
||||
enabled=self.pipeline_options.do_picture_description,
|
||||
artifacts_path=artifacts_path,
|
||||
options=self.pipeline_options.picture_description_options,
|
||||
accelerator_options=self.pipeline_options.accelerator_options,
|
||||
)
|
||||
return None
|
||||
factory = get_picture_description_factory(
|
||||
allow_external_plugins=self.pipeline_options.allow_external_plugins
|
||||
)
|
||||
return factory.create_instance(
|
||||
options=self.pipeline_options.picture_description_options,
|
||||
enabled=self.pipeline_options.do_picture_description,
|
||||
enable_remote_services=self.pipeline_options.enable_remote_services,
|
||||
artifacts_path=artifacts_path,
|
||||
accelerator_options=self.pipeline_options.accelerator_options,
|
||||
)
|
||||
|
||||
def initialize_page(self, conv_res: ConversionResult, page: Page) -> Page:
|
||||
with TimeRecorder(conv_res, "page_init"):
|
||||
|
@ -1,30 +1,13 @@
|
||||
import itertools
|
||||
import logging
|
||||
import re
|
||||
import warnings
|
||||
from io import BytesIO
|
||||
|
||||
# from io import BytesIO
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from typing import List, Optional, Union, cast
|
||||
|
||||
from docling_core.types import DoclingDocument
|
||||
from docling_core.types.doc import (
|
||||
BoundingBox,
|
||||
DocItem,
|
||||
DocItemLabel,
|
||||
DoclingDocument,
|
||||
GroupLabel,
|
||||
ImageRef,
|
||||
ImageRefMode,
|
||||
PictureItem,
|
||||
ProvenanceItem,
|
||||
Size,
|
||||
TableCell,
|
||||
TableData,
|
||||
TableItem,
|
||||
)
|
||||
from docling_core.types.doc.tokens import DocumentToken, TableToken
|
||||
# from docling_core.types import DoclingDocument
|
||||
from docling_core.types.doc import BoundingBox, DocItem, ImageRef, PictureItem, TextItem
|
||||
from docling_core.types.doc.document import DocTagsDocument
|
||||
from PIL import Image as PILImage
|
||||
|
||||
from docling.backend.abstract_backend import AbstractDocumentBackend
|
||||
from docling.backend.md_backend import MarkdownDocumentBackend
|
||||
@ -32,11 +15,12 @@ from docling.backend.pdf_backend import PdfDocumentBackend
|
||||
from docling.datamodel.base_models import InputFormat, Page
|
||||
from docling.datamodel.document import ConversionResult, InputDocument
|
||||
from docling.datamodel.pipeline_options import (
|
||||
PdfPipelineOptions,
|
||||
InferenceFramework,
|
||||
ResponseFormat,
|
||||
VlmPipelineOptions,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.models.hf_mlx_model import HuggingFaceMlxModel
|
||||
from docling.models.hf_vlm_model import HuggingFaceVlmModel
|
||||
from docling.pipeline.base_pipeline import PaginatedPipeline
|
||||
from docling.utils.profiling import ProfilingScope, TimeRecorder
|
||||
@ -50,12 +34,6 @@ class VlmPipeline(PaginatedPipeline):
|
||||
super().__init__(pipeline_options)
|
||||
self.keep_backend = True
|
||||
|
||||
warnings.warn(
|
||||
"The VlmPipeline is currently experimental and may change in upcoming versions without notice.",
|
||||
category=UserWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
self.pipeline_options: VlmPipelineOptions
|
||||
|
||||
artifacts_path: Optional[Path] = None
|
||||
@ -79,14 +57,27 @@ class VlmPipeline(PaginatedPipeline):
|
||||
|
||||
self.keep_images = self.pipeline_options.generate_page_images
|
||||
|
||||
self.build_pipe = [
|
||||
HuggingFaceVlmModel(
|
||||
enabled=True, # must be always enabled for this pipeline to make sense.
|
||||
artifacts_path=artifacts_path,
|
||||
accelerator_options=pipeline_options.accelerator_options,
|
||||
vlm_options=self.pipeline_options.vlm_options,
|
||||
),
|
||||
]
|
||||
if (
|
||||
self.pipeline_options.vlm_options.inference_framework
|
||||
== InferenceFramework.MLX
|
||||
):
|
||||
self.build_pipe = [
|
||||
HuggingFaceMlxModel(
|
||||
enabled=True, # must be always enabled for this pipeline to make sense.
|
||||
artifacts_path=artifacts_path,
|
||||
accelerator_options=pipeline_options.accelerator_options,
|
||||
vlm_options=self.pipeline_options.vlm_options,
|
||||
),
|
||||
]
|
||||
else:
|
||||
self.build_pipe = [
|
||||
HuggingFaceVlmModel(
|
||||
enabled=True, # must be always enabled for this pipeline to make sense.
|
||||
artifacts_path=artifacts_path,
|
||||
accelerator_options=pipeline_options.accelerator_options,
|
||||
vlm_options=self.pipeline_options.vlm_options,
|
||||
),
|
||||
]
|
||||
|
||||
self.enrichment_pipe = [
|
||||
# Other models working on `NodeItem` elements in the DoclingDocument
|
||||
@ -100,6 +91,17 @@ class VlmPipeline(PaginatedPipeline):
|
||||
|
||||
return page
|
||||
|
||||
def extract_text_from_backend(
|
||||
self, page: Page, bbox: Union[BoundingBox, None]
|
||||
) -> str:
|
||||
# Convert bounding box normalized to 0-100 into page coordinates for cropping
|
||||
text = ""
|
||||
if bbox:
|
||||
if page.size:
|
||||
if page._backend:
|
||||
text = page._backend.get_text_in_rect(bbox)
|
||||
return text
|
||||
|
||||
def _assemble_document(self, conv_res: ConversionResult) -> ConversionResult:
|
||||
with TimeRecorder(conv_res, "doc_assemble", scope=ProfilingScope.DOCUMENT):
|
||||
|
||||
@ -107,7 +109,45 @@ class VlmPipeline(PaginatedPipeline):
|
||||
self.pipeline_options.vlm_options.response_format
|
||||
== ResponseFormat.DOCTAGS
|
||||
):
|
||||
conv_res.document = self._turn_tags_into_doc(conv_res.pages)
|
||||
doctags_list = []
|
||||
image_list = []
|
||||
for page in conv_res.pages:
|
||||
predicted_doctags = ""
|
||||
img = PILImage.new("RGB", (1, 1), "rgb(255,255,255)")
|
||||
if page.predictions.vlm_response:
|
||||
predicted_doctags = page.predictions.vlm_response.text
|
||||
if page.image:
|
||||
img = page.image
|
||||
image_list.append(img)
|
||||
doctags_list.append(predicted_doctags)
|
||||
|
||||
doctags_list_c = cast(List[Union[Path, str]], doctags_list)
|
||||
image_list_c = cast(List[Union[Path, PILImage.Image]], image_list)
|
||||
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(
|
||||
doctags_list_c, image_list_c
|
||||
)
|
||||
conv_res.document.load_from_doctags(doctags_doc)
|
||||
|
||||
# If forced backend text, replace model predicted text with backend one
|
||||
if page.size:
|
||||
if self.force_backend_text:
|
||||
scale = self.pipeline_options.images_scale
|
||||
for element, _level in conv_res.document.iterate_items():
|
||||
if (
|
||||
not isinstance(element, TextItem)
|
||||
or len(element.prov) == 0
|
||||
):
|
||||
continue
|
||||
crop_bbox = (
|
||||
element.prov[0]
|
||||
.bbox.scaled(scale=scale)
|
||||
.to_top_left_origin(
|
||||
page_height=page.size.height * scale
|
||||
)
|
||||
)
|
||||
txt = self.extract_text_from_backend(page, crop_bbox)
|
||||
element.text = txt
|
||||
element.orig = txt
|
||||
elif (
|
||||
self.pipeline_options.vlm_options.response_format
|
||||
== ResponseFormat.MARKDOWN
|
||||
@ -165,366 +205,6 @@ class VlmPipeline(PaginatedPipeline):
|
||||
)
|
||||
return backend.convert()
|
||||
|
||||
def _turn_tags_into_doc(self, pages: list[Page]) -> DoclingDocument:
|
||||
###############################################
|
||||
# Tag definitions and color mappings
|
||||
###############################################
|
||||
|
||||
# Maps the recognized tag to a Docling label.
|
||||
# Code items will be given DocItemLabel.CODE
|
||||
tag_to_doclabel = {
|
||||
"title": DocItemLabel.TITLE,
|
||||
"document_index": DocItemLabel.DOCUMENT_INDEX,
|
||||
"otsl": DocItemLabel.TABLE,
|
||||
"section_header_level_1": DocItemLabel.SECTION_HEADER,
|
||||
"checkbox_selected": DocItemLabel.CHECKBOX_SELECTED,
|
||||
"checkbox_unselected": DocItemLabel.CHECKBOX_UNSELECTED,
|
||||
"text": DocItemLabel.TEXT,
|
||||
"page_header": DocItemLabel.PAGE_HEADER,
|
||||
"page_footer": DocItemLabel.PAGE_FOOTER,
|
||||
"formula": DocItemLabel.FORMULA,
|
||||
"caption": DocItemLabel.CAPTION,
|
||||
"picture": DocItemLabel.PICTURE,
|
||||
"list_item": DocItemLabel.LIST_ITEM,
|
||||
"footnote": DocItemLabel.FOOTNOTE,
|
||||
"code": DocItemLabel.CODE,
|
||||
}
|
||||
|
||||
# Maps each tag to an associated bounding box color.
|
||||
tag_to_color = {
|
||||
"title": "blue",
|
||||
"document_index": "darkblue",
|
||||
"otsl": "green",
|
||||
"section_header_level_1": "purple",
|
||||
"checkbox_selected": "black",
|
||||
"checkbox_unselected": "gray",
|
||||
"text": "red",
|
||||
"page_header": "orange",
|
||||
"page_footer": "cyan",
|
||||
"formula": "pink",
|
||||
"caption": "magenta",
|
||||
"picture": "yellow",
|
||||
"list_item": "brown",
|
||||
"footnote": "darkred",
|
||||
"code": "lightblue",
|
||||
}
|
||||
|
||||
def extract_bounding_box(text_chunk: str) -> Optional[BoundingBox]:
|
||||
"""Extracts <loc_...> bounding box coords from the chunk, normalized by / 500."""
|
||||
coords = re.findall(r"<loc_(\d+)>", text_chunk)
|
||||
if len(coords) == 4:
|
||||
l, t, r, b = map(float, coords)
|
||||
return BoundingBox(l=l / 500, t=t / 500, r=r / 500, b=b / 500)
|
||||
return None
|
||||
|
||||
def extract_inner_text(text_chunk: str) -> str:
|
||||
"""Strips all <...> tags inside the chunk to get the raw text content."""
|
||||
return re.sub(r"<.*?>", "", text_chunk, flags=re.DOTALL).strip()
|
||||
|
||||
def extract_text_from_backend(page: Page, bbox: BoundingBox | None) -> str:
|
||||
# Convert bounding box normalized to 0-100 into page coordinates for cropping
|
||||
text = ""
|
||||
if bbox:
|
||||
if page.size:
|
||||
bbox.l = bbox.l * page.size.width
|
||||
bbox.t = bbox.t * page.size.height
|
||||
bbox.r = bbox.r * page.size.width
|
||||
bbox.b = bbox.b * page.size.height
|
||||
if page._backend:
|
||||
text = page._backend.get_text_in_rect(bbox)
|
||||
return text
|
||||
|
||||
def otsl_parse_texts(texts, tokens):
|
||||
split_word = TableToken.OTSL_NL.value
|
||||
split_row_tokens = [
|
||||
list(y)
|
||||
for x, y in itertools.groupby(tokens, lambda z: z == split_word)
|
||||
if not x
|
||||
]
|
||||
table_cells = []
|
||||
r_idx = 0
|
||||
c_idx = 0
|
||||
|
||||
def count_right(tokens, c_idx, r_idx, which_tokens):
|
||||
span = 0
|
||||
c_idx_iter = c_idx
|
||||
while tokens[r_idx][c_idx_iter] in which_tokens:
|
||||
c_idx_iter += 1
|
||||
span += 1
|
||||
if c_idx_iter >= len(tokens[r_idx]):
|
||||
return span
|
||||
return span
|
||||
|
||||
def count_down(tokens, c_idx, r_idx, which_tokens):
|
||||
span = 0
|
||||
r_idx_iter = r_idx
|
||||
while tokens[r_idx_iter][c_idx] in which_tokens:
|
||||
r_idx_iter += 1
|
||||
span += 1
|
||||
if r_idx_iter >= len(tokens):
|
||||
return span
|
||||
return span
|
||||
|
||||
for i, text in enumerate(texts):
|
||||
cell_text = ""
|
||||
if text in [
|
||||
TableToken.OTSL_FCEL.value,
|
||||
TableToken.OTSL_ECEL.value,
|
||||
TableToken.OTSL_CHED.value,
|
||||
TableToken.OTSL_RHED.value,
|
||||
TableToken.OTSL_SROW.value,
|
||||
]:
|
||||
row_span = 1
|
||||
col_span = 1
|
||||
right_offset = 1
|
||||
if text != TableToken.OTSL_ECEL.value:
|
||||
cell_text = texts[i + 1]
|
||||
right_offset = 2
|
||||
|
||||
# Check next element(s) for lcel / ucel / xcel, set properly row_span, col_span
|
||||
next_right_cell = ""
|
||||
if i + right_offset < len(texts):
|
||||
next_right_cell = texts[i + right_offset]
|
||||
|
||||
next_bottom_cell = ""
|
||||
if r_idx + 1 < len(split_row_tokens):
|
||||
if c_idx < len(split_row_tokens[r_idx + 1]):
|
||||
next_bottom_cell = split_row_tokens[r_idx + 1][c_idx]
|
||||
|
||||
if next_right_cell in [
|
||||
TableToken.OTSL_LCEL.value,
|
||||
TableToken.OTSL_XCEL.value,
|
||||
]:
|
||||
# we have horisontal spanning cell or 2d spanning cell
|
||||
col_span += count_right(
|
||||
split_row_tokens,
|
||||
c_idx + 1,
|
||||
r_idx,
|
||||
[TableToken.OTSL_LCEL.value, TableToken.OTSL_XCEL.value],
|
||||
)
|
||||
if next_bottom_cell in [
|
||||
TableToken.OTSL_UCEL.value,
|
||||
TableToken.OTSL_XCEL.value,
|
||||
]:
|
||||
# we have a vertical spanning cell or 2d spanning cell
|
||||
row_span += count_down(
|
||||
split_row_tokens,
|
||||
c_idx,
|
||||
r_idx + 1,
|
||||
[TableToken.OTSL_UCEL.value, TableToken.OTSL_XCEL.value],
|
||||
)
|
||||
|
||||
table_cells.append(
|
||||
TableCell(
|
||||
text=cell_text.strip(),
|
||||
row_span=row_span,
|
||||
col_span=col_span,
|
||||
start_row_offset_idx=r_idx,
|
||||
end_row_offset_idx=r_idx + row_span,
|
||||
start_col_offset_idx=c_idx,
|
||||
end_col_offset_idx=c_idx + col_span,
|
||||
)
|
||||
)
|
||||
if text in [
|
||||
TableToken.OTSL_FCEL.value,
|
||||
TableToken.OTSL_ECEL.value,
|
||||
TableToken.OTSL_CHED.value,
|
||||
TableToken.OTSL_RHED.value,
|
||||
TableToken.OTSL_SROW.value,
|
||||
TableToken.OTSL_LCEL.value,
|
||||
TableToken.OTSL_UCEL.value,
|
||||
TableToken.OTSL_XCEL.value,
|
||||
]:
|
||||
c_idx += 1
|
||||
if text == TableToken.OTSL_NL.value:
|
||||
r_idx += 1
|
||||
c_idx = 0
|
||||
return table_cells, split_row_tokens
|
||||
|
||||
def otsl_extract_tokens_and_text(s: str):
|
||||
# Pattern to match anything enclosed by < > (including the angle brackets themselves)
|
||||
pattern = r"(<[^>]+>)"
|
||||
# Find all tokens (e.g. "<otsl>", "<loc_140>", etc.)
|
||||
tokens = re.findall(pattern, s)
|
||||
# Remove any tokens that start with "<loc_"
|
||||
tokens = [
|
||||
token
|
||||
for token in tokens
|
||||
if not (
|
||||
token.startswith(rf"<{DocumentToken.LOC.value}")
|
||||
or token
|
||||
in [
|
||||
rf"<{DocumentToken.OTSL.value}>",
|
||||
rf"</{DocumentToken.OTSL.value}>",
|
||||
]
|
||||
)
|
||||
]
|
||||
# Split the string by those tokens to get the in-between text
|
||||
text_parts = re.split(pattern, s)
|
||||
text_parts = [
|
||||
token
|
||||
for token in text_parts
|
||||
if not (
|
||||
token.startswith(rf"<{DocumentToken.LOC.value}")
|
||||
or token
|
||||
in [
|
||||
rf"<{DocumentToken.OTSL.value}>",
|
||||
rf"</{DocumentToken.OTSL.value}>",
|
||||
]
|
||||
)
|
||||
]
|
||||
# Remove any empty or purely whitespace strings from text_parts
|
||||
text_parts = [part for part in text_parts if part.strip()]
|
||||
|
||||
return tokens, text_parts
|
||||
|
||||
def parse_table_content(otsl_content: str) -> TableData:
|
||||
tokens, mixed_texts = otsl_extract_tokens_and_text(otsl_content)
|
||||
table_cells, split_row_tokens = otsl_parse_texts(mixed_texts, tokens)
|
||||
|
||||
return TableData(
|
||||
num_rows=len(split_row_tokens),
|
||||
num_cols=(
|
||||
max(len(row) for row in split_row_tokens) if split_row_tokens else 0
|
||||
),
|
||||
table_cells=table_cells,
|
||||
)
|
||||
|
||||
doc = DoclingDocument(name="Document")
|
||||
for pg_idx, page in enumerate(pages):
|
||||
xml_content = ""
|
||||
predicted_text = ""
|
||||
if page.predictions.vlm_response:
|
||||
predicted_text = page.predictions.vlm_response.text
|
||||
image = page.image
|
||||
|
||||
page_no = pg_idx + 1
|
||||
bounding_boxes = []
|
||||
|
||||
if page.size:
|
||||
pg_width = page.size.width
|
||||
pg_height = page.size.height
|
||||
size = Size(width=pg_width, height=pg_height)
|
||||
parent_page = doc.add_page(page_no=page_no, size=size)
|
||||
|
||||
"""
|
||||
1. Finds all <tag>...</tag> blocks in the entire string (multi-line friendly) in the order they appear.
|
||||
2. For each chunk, extracts bounding box (if any) and inner text.
|
||||
3. Adds the item to a DoclingDocument structure with the right label.
|
||||
4. Tracks bounding boxes + color in a separate list for later visualization.
|
||||
"""
|
||||
|
||||
# Regex for all recognized tags
|
||||
tag_pattern = (
|
||||
rf"<(?P<tag>{DocItemLabel.TITLE}|{DocItemLabel.DOCUMENT_INDEX}|"
|
||||
rf"{DocItemLabel.CHECKBOX_UNSELECTED}|{DocItemLabel.CHECKBOX_SELECTED}|"
|
||||
rf"{DocItemLabel.TEXT}|{DocItemLabel.PAGE_HEADER}|"
|
||||
rf"{DocItemLabel.PAGE_FOOTER}|{DocItemLabel.FORMULA}|"
|
||||
rf"{DocItemLabel.CAPTION}|{DocItemLabel.PICTURE}|"
|
||||
rf"{DocItemLabel.LIST_ITEM}|{DocItemLabel.FOOTNOTE}|{DocItemLabel.CODE}|"
|
||||
rf"{DocItemLabel.SECTION_HEADER}_level_1|{DocumentToken.OTSL.value})>.*?</(?P=tag)>"
|
||||
)
|
||||
|
||||
# DocumentToken.OTSL
|
||||
pattern = re.compile(tag_pattern, re.DOTALL)
|
||||
|
||||
# Go through each match in order
|
||||
for match in pattern.finditer(predicted_text):
|
||||
full_chunk = match.group(0)
|
||||
tag_name = match.group("tag")
|
||||
|
||||
bbox = extract_bounding_box(full_chunk)
|
||||
doc_label = tag_to_doclabel.get(tag_name, DocItemLabel.PARAGRAPH)
|
||||
color = tag_to_color.get(tag_name, "white")
|
||||
|
||||
# Store bounding box + color
|
||||
if bbox:
|
||||
bounding_boxes.append((bbox, color))
|
||||
|
||||
if tag_name == DocumentToken.OTSL.value:
|
||||
table_data = parse_table_content(full_chunk)
|
||||
bbox = extract_bounding_box(full_chunk)
|
||||
|
||||
if bbox:
|
||||
prov = ProvenanceItem(
|
||||
bbox=bbox.resize_by_scale(pg_width, pg_height),
|
||||
charspan=(0, 0),
|
||||
page_no=page_no,
|
||||
)
|
||||
doc.add_table(data=table_data, prov=prov)
|
||||
else:
|
||||
doc.add_table(data=table_data)
|
||||
|
||||
elif tag_name == DocItemLabel.PICTURE:
|
||||
text_caption_content = extract_inner_text(full_chunk)
|
||||
if image:
|
||||
if bbox:
|
||||
im_width, im_height = image.size
|
||||
|
||||
crop_box = (
|
||||
int(bbox.l * im_width),
|
||||
int(bbox.t * im_height),
|
||||
int(bbox.r * im_width),
|
||||
int(bbox.b * im_height),
|
||||
)
|
||||
cropped_image = image.crop(crop_box)
|
||||
pic = doc.add_picture(
|
||||
parent=None,
|
||||
image=ImageRef.from_pil(image=cropped_image, dpi=72),
|
||||
prov=(
|
||||
ProvenanceItem(
|
||||
bbox=bbox.resize_by_scale(pg_width, pg_height),
|
||||
charspan=(0, 0),
|
||||
page_no=page_no,
|
||||
)
|
||||
),
|
||||
)
|
||||
# If there is a caption to an image, add it as well
|
||||
if len(text_caption_content) > 0:
|
||||
caption_item = doc.add_text(
|
||||
label=DocItemLabel.CAPTION,
|
||||
text=text_caption_content,
|
||||
parent=None,
|
||||
)
|
||||
pic.captions.append(caption_item.get_ref())
|
||||
else:
|
||||
if bbox:
|
||||
# In case we don't have access to an binary of an image
|
||||
doc.add_picture(
|
||||
parent=None,
|
||||
prov=ProvenanceItem(
|
||||
bbox=bbox, charspan=(0, 0), page_no=page_no
|
||||
),
|
||||
)
|
||||
# If there is a caption to an image, add it as well
|
||||
if len(text_caption_content) > 0:
|
||||
caption_item = doc.add_text(
|
||||
label=DocItemLabel.CAPTION,
|
||||
text=text_caption_content,
|
||||
parent=None,
|
||||
)
|
||||
pic.captions.append(caption_item.get_ref())
|
||||
else:
|
||||
# For everything else, treat as text
|
||||
if self.force_backend_text:
|
||||
text_content = extract_text_from_backend(page, bbox)
|
||||
else:
|
||||
text_content = extract_inner_text(full_chunk)
|
||||
doc.add_text(
|
||||
label=doc_label,
|
||||
text=text_content,
|
||||
prov=(
|
||||
ProvenanceItem(
|
||||
bbox=bbox.resize_by_scale(pg_width, pg_height),
|
||||
charspan=(0, len(text_content)),
|
||||
page_no=page_no,
|
||||
)
|
||||
if bbox
|
||||
else None
|
||||
),
|
||||
)
|
||||
return doc
|
||||
|
||||
@classmethod
|
||||
def get_default_options(cls) -> VlmPipelineOptions:
|
||||
return VlmPipelineOptions()
|
||||
|
@ -154,7 +154,7 @@ def main():
|
||||
|
||||
conv_results = doc_converter.convert_all(
|
||||
input_doc_paths,
|
||||
raises_on_error=True, # to let conversion run through all and examine results at the end
|
||||
raises_on_error=False, # to let conversion run through all and examine results at the end
|
||||
)
|
||||
success_count, partial_success_count, failure_count = export_documents(
|
||||
conv_results, output_dir=Path("scratch")
|
||||
|
@ -10,13 +10,15 @@ from docling.datamodel.pipeline_options import (
|
||||
VlmPipelineOptions,
|
||||
granite_vision_vlm_conversion_options,
|
||||
smoldocling_vlm_conversion_options,
|
||||
smoldocling_vlm_mlx_conversion_options,
|
||||
)
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling.pipeline.vlm_pipeline import VlmPipeline
|
||||
|
||||
sources = [
|
||||
"tests/data/2305.03393v1-pg9-img.png",
|
||||
# "tests/data/2305.03393v1-pg9-img.png",
|
||||
"tests/data/pdf/2305.03393v1-pg9.pdf",
|
||||
]
|
||||
|
||||
## Use experimental VlmPipeline
|
||||
@ -29,7 +31,10 @@ pipeline_options.force_backend_text = False
|
||||
# pipeline_options.accelerator_options.cuda_use_flash_attention2 = True
|
||||
|
||||
## Pick a VLM model. We choose SmolDocling-256M by default
|
||||
pipeline_options.vlm_options = smoldocling_vlm_conversion_options
|
||||
# pipeline_options.vlm_options = smoldocling_vlm_conversion_options
|
||||
|
||||
## Pick a VLM model. Fast Apple Silicon friendly implementation for SmolDocling-256M via MLX
|
||||
pipeline_options.vlm_options = smoldocling_vlm_mlx_conversion_options
|
||||
|
||||
## Alternative VLM models:
|
||||
# pipeline_options.vlm_options = granite_vision_vlm_conversion_options
|
||||
@ -63,9 +68,6 @@ for source in sources:
|
||||
|
||||
res = converter.convert(source)
|
||||
|
||||
print("------------------------------------------------")
|
||||
print("MD:")
|
||||
print("------------------------------------------------")
|
||||
print("")
|
||||
print(res.document.export_to_markdown())
|
||||
|
||||
@ -83,8 +85,17 @@ for source in sources:
|
||||
with (out_path / f"{res.input.file.stem}.json").open("w") as fp:
|
||||
fp.write(json.dumps(res.document.export_to_dict()))
|
||||
|
||||
pg_num = res.document.num_pages()
|
||||
res.document.save_as_json(
|
||||
out_path / f"{res.input.file.stem}.md",
|
||||
image_mode=ImageRefMode.PLACEHOLDER,
|
||||
)
|
||||
|
||||
res.document.save_as_markdown(
|
||||
out_path / f"{res.input.file.stem}.md",
|
||||
image_mode=ImageRefMode.PLACEHOLDER,
|
||||
)
|
||||
|
||||
pg_num = res.document.num_pages()
|
||||
print("")
|
||||
inference_time = time.time() - start_time
|
||||
print(
|
||||
|
@ -13,6 +13,7 @@
|
||||
[](https://github.com/pre-commit/pre-commit)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://pepy.tech/projects/docling)
|
||||
[](https://lfaidata.foundation/projects/)
|
||||
|
||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||
|
||||
@ -25,12 +26,12 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
||||
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||
* 🥚 Support of Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🆕🔥
|
||||
* 💻 Simple and convenient CLI
|
||||
|
||||
### Coming soon
|
||||
|
||||
* 📝 Metadata extraction, including title, authors, references & language
|
||||
* 📝 Inclusion of Visual Language Models ([SmolDocling](https://huggingface.co/blog/smolervlm#smoldocling))
|
||||
* 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
|
||||
* 📝 Complex chemistry understanding (Molecular structures)
|
||||
|
||||
@ -43,9 +44,13 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
||||
<a href="reference/document_converter/" class="card"><b>Reference</b><br />See more API details</a>
|
||||
</div>
|
||||
|
||||
## IBM ❤️ Open Source AI
|
||||
## LF AI & Data
|
||||
|
||||
Docling has been brought to you by IBM.
|
||||
Docling is hosted as a project in the [LF AI & Data Foundation](https://lfaidata.foundation/projects/).
|
||||
|
||||
### IBM ❤️ Open Source AI
|
||||
|
||||
The project was started by the AI for knowledge team at IBM Research Zurich.
|
||||
|
||||
[supported_formats]: ./usage/supported_formats.md
|
||||
[docling_document]: ./concepts/docling_document.md
|
||||
|
35
docs/integrations/apify.md
Normal file
35
docs/integrations/apify.md
Normal file
@ -0,0 +1,35 @@
|
||||
You can run Docling in the cloud without installation using the [Docling Actor][apify] on Apify platform. Simply provide a document URL and get the processed result:
|
||||
|
||||
<a href="https://apify.com/vancura/docling?fpr=docling"><img src="https://apify.com/ext/run-on-apify.png" alt="Run Docling Actor on Apify" width="176" height="39" /></a>
|
||||
|
||||
```bash
|
||||
apify call vancura/docling -i '{
|
||||
"options": {
|
||||
"to_formats": ["md", "json", "html", "text", "doctags"]
|
||||
},
|
||||
"http_sources": [
|
||||
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
|
||||
{"url": "https://arxiv.org/pdf/2408.09869"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
The Actor stores results in:
|
||||
|
||||
* Processed document in key-value store (`OUTPUT_RESULT`)
|
||||
* Processing logs (`DOCLING_LOG`)
|
||||
* Dataset record with result URL and status
|
||||
|
||||
Read more about the [Docling Actor](.actor/README.md), including how to use it via the Apify API and CLI.
|
||||
|
||||
- 💻 [GitHub][github]
|
||||
- 📖 [Docs][docs]
|
||||
- 📦 [Docling Actor][apify]
|
||||
|
||||
[github]: https://github.com/docling-project/docling/tree/main/.actor/
|
||||
[docs]: https://github.com/docling-project/docling/tree/main/.actor/README.md
|
||||
[apify]: https://apify.com/vancura/docling?fpr=docling
|
||||
|
||||
|
||||
|
||||
|
@ -17,10 +17,15 @@ print(result.document.export_to_markdown()) # output: "### Docling Technical Re
|
||||
|
||||
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
|
||||
|
||||
A simple example would look like this:
|
||||
```console
|
||||
docling https://arxiv.org/pdf/2206.01062
|
||||
```
|
||||
You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
|
||||
```bash
|
||||
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
|
||||
```
|
||||
This will use MLX acceleration on supported Apple Silicon hardware.
|
||||
|
||||
|
||||
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md).
|
||||
|
||||
|
@ -111,6 +111,7 @@ nav:
|
||||
- "LlamaIndex": integrations/llamaindex.md
|
||||
- "txtai": integrations/txtai.md
|
||||
- ⭐️ Featured:
|
||||
- "Apify": integrations/apify.md
|
||||
- "Data Prep Kit": integrations/data_prep_kit.md
|
||||
- "InstructLab": integrations/instructlab.md
|
||||
- "NVIDIA": integrations/nvidia.md
|
||||
|
748
poetry.lock
generated
748
poetry.lock
generated
File diff suppressed because it is too large
Load Diff
@ -1,6 +1,6 @@
|
||||
[tool.poetry]
|
||||
name = "docling"
|
||||
version = "2.26.0" # DO NOT EDIT, updated automatically
|
||||
version = "2.28.0" # DO NOT EDIT, updated automatically
|
||||
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
||||
authors = [
|
||||
"Christoph Auer <cau@zurich.ibm.com>",
|
||||
@ -46,9 +46,9 @@ packages = [{ include = "docling" }]
|
||||
######################
|
||||
python = "^3.9"
|
||||
pydantic = "^2.0.0"
|
||||
docling-core = {extras = ["chunking"], version = "^2.23.0"}
|
||||
docling-core = {extras = ["chunking"], version = "^2.23.1"}
|
||||
docling-ibm-models = "^3.4.0"
|
||||
docling-parse = "^4.0.0"
|
||||
docling-parse = {git = "https://github.com/DS4SD/docling-parse", rev = "cau/line-sanitation-update"}
|
||||
filetype = "^1.2.0"
|
||||
pypdfium2 = "^4.30.0"
|
||||
pydantic-settings = "^2.3.0"
|
||||
@ -88,6 +88,7 @@ accelerate = [
|
||||
]
|
||||
pillow = ">=10.0.0,<12.0.0"
|
||||
tqdm = "^4.65.0"
|
||||
pluggy = "^1.0.0"
|
||||
pylatexenc = "^2.10"
|
||||
|
||||
[tool.poetry.group.dev.dependencies]
|
||||
@ -156,6 +157,9 @@ rapidocr = ["rapidocr-onnxruntime", "onnxruntime"]
|
||||
docling = "docling.cli.main:app"
|
||||
docling-tools = "docling.cli.tools:app"
|
||||
|
||||
[tool.poetry.plugins."docling"]
|
||||
"docling_defaults" = "docling.models.plugins.defaults"
|
||||
|
||||
[build-system]
|
||||
requires = ["poetry-core"]
|
||||
build-backend = "poetry.core.masonry.api"
|
||||
@ -188,6 +192,7 @@ module = [
|
||||
"docling_ibm_models.*",
|
||||
"easyocr.*",
|
||||
"ocrmac.*",
|
||||
"mlx_vlm.*",
|
||||
"lxml.*",
|
||||
"huggingface_hub.*",
|
||||
"transformers.*",
|
||||
|
@ -1,4 +1,5 @@
|
||||
<document>
|
||||
<paragraph><location><page_1><loc_3><loc_75><loc_6><loc_80></location>2022</paragraph>
|
||||
<subtitle-level-1><location><page_1><loc_16><loc_85><loc_82><loc_86></location>TableFormer: Table Structure Understanding with Transformers.</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_1><loc_23><loc_78><loc_74><loc_81></location>Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_34><loc_77><loc_62><loc_78></location>{ ahn,nli,mly,taa @zurich.ibm.com }</paragraph>
|
||||
@ -42,7 +43,7 @@
|
||||
<paragraph><location><page_2><loc_10><loc_25><loc_47><loc_29></location>- · We present SynthTabNet a synthetically generated dataset, with various appearance styles and complexity.</paragraph>
|
||||
<paragraph><location><page_2><loc_10><loc_19><loc_47><loc_24></location>- · An augmented dataset based on PubTabNet [37], FinTabNet [36], and TableBank [17] with generated ground-truth for reproducibility.</paragraph>
|
||||
<paragraph><location><page_2><loc_8><loc_12><loc_47><loc_18></location>The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe</paragraph>
|
||||
<paragraph><location><page_2><loc_50><loc_86><loc_89><loc_90></location>its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.</paragraph>
|
||||
<paragraph><location><page_2><loc_50><loc_86><loc_89><loc_90></location>its results &performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.</paragraph>
|
||||
<subtitle-level-1><location><page_2><loc_50><loc_83><loc_81><loc_85></location>2. Previous work and State of the Art</subtitle-level-1>
|
||||
<paragraph><location><page_2><loc_50><loc_58><loc_89><loc_82></location>Identifying the structure of a table has been an outstanding problem in the document-parsing community, that motivates many organised public challenges [6, 4, 14]. The difficulty of the problem can be attributed to a number of factors. First, there is a large variety in the shapes and sizes of tables. Such large variety requires a flexible method. This is especially true for complex column- and row headers, which can be extremely intricate and demanding. A second factor of complexity is the lack of data with regard to table-structure. Until the publication of PubTabNet [37], there were no large datasets (i.e. > 100 K tables) that provided structure information. This happens primarily due to the fact that tables are notoriously time-consuming to annotate by hand. However, this has definitely changed in recent years with the deliverance of PubTabNet [37], FinTabNet [36], TableBank [17] etc.</paragraph>
|
||||
<paragraph><location><page_2><loc_50><loc_43><loc_89><loc_58></location>Before the rising popularity of deep neural networks, the community relied heavily on heuristic and/or statistical methods to do table structure identification [3, 7, 11, 5, 13, 28]. Although such methods work well on constrained tables [12], a more data-driven approach can be applied due to the advent of convolutional neural networks (CNNs) and the availability of large datasets. To the best-of-our knowledge, there are currently two different types of network architecture that are being pursued for state-of-the-art tablestructure identification.</paragraph>
|
||||
@ -58,7 +59,7 @@
|
||||
<caption>Figure 2: Distribution of the tables across different table dimensions in PubTabNet + FinTabNet datasets</caption>
|
||||
</figure>
|
||||
<paragraph><location><page_3><loc_50><loc_59><loc_71><loc_60></location>balance in the previous datasets.</paragraph>
|
||||
<paragraph><location><page_3><loc_50><loc_21><loc_89><loc_58></location>The PubTabNet dataset contains 509k tables delivered as annotated PNG images. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98% and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDF documents with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.</paragraph>
|
||||
<paragraph><location><page_3><loc_50><loc_21><loc_89><loc_58></location>The PubTabNet dataset contains 509k tables delivered as annotated PNGimages. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98%and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDFdocuments with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.</paragraph>
|
||||
<paragraph><location><page_3><loc_50><loc_10><loc_89><loc_20></location>Due to the heterogeneity across the dataset formats, it was necessary to combine all available data into one homogenized dataset before we could train our models for practical purposes. Given the size of PubTabNet, we adopted its annotation format and we extracted and converted all tables as PNG images with a resolution of 72 dpi. Additionally, we have filtered out tables with extreme sizes due to small</paragraph>
|
||||
<paragraph><location><page_4><loc_8><loc_88><loc_47><loc_90></location>amount of such tables, and kept only those ones ranging between 1*1 and 20*10 (rows/columns).</paragraph>
|
||||
<paragraph><location><page_4><loc_8><loc_60><loc_47><loc_87></location>The availability of the bounding boxes for all table cells is essential to train our models. In order to distinguish between empty and non-empty bounding boxes, we have introduced a binary class in the annotation. Unfortunately, the original datasets either omit the bounding boxes for whole tables (e.g. TableBank) or they narrow their scope only to non-empty cells. Therefore, it was imperative to introduce a data pre-processing procedure that generates the missing bounding boxes out of the annotation information. This procedure first parses the provided table structure and calculates the dimensions of the most fine-grained grid that covers the table structure. Notice that each table cell may occupy multiple grid squares due to row or column spans. In case of PubTabNet we had to compute missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</paragraph>
|
||||
@ -95,7 +96,7 @@
|
||||
<paragraph><location><page_5><loc_50><loc_63><loc_89><loc_68></location>forming classification, and adding an adaptive pooling layer of size 28*28. ResNet by default downsamples the image resolution by 32 and then the encoded image is provided to both the Structure Decoder , and Cell BBox Decoder .</paragraph>
|
||||
<paragraph><location><page_5><loc_50><loc_48><loc_89><loc_62></location>Structure Decoder. The transformer architecture of this component is based on the work proposed in [31]. After extensive experimentation, the Structure Decoder is modeled as a transformer encoder with two encoder layers and a transformer decoder made from a stack of 4 decoder layers that comprise mainly of multi-head attention and feed forward layers. This configuration uses fewer layers and heads in comparison to networks applied to other problems (e.g. 'Scene Understanding', 'Image Captioning'), something which we relate to the simplicity of table images.</paragraph>
|
||||
<paragraph><location><page_5><loc_50><loc_31><loc_89><loc_47></location>The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.</paragraph>
|
||||
<paragraph><location><page_5><loc_50><loc_18><loc_89><loc_31></location>Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.</paragraph>
|
||||
<paragraph><location><page_5><loc_50><loc_18><loc_89><loc_31></location>Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTMLstructure tags become the object query.</paragraph>
|
||||
<paragraph><location><page_5><loc_50><loc_10><loc_89><loc_17></location>The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-</paragraph>
|
||||
<paragraph><location><page_6><loc_8><loc_80><loc_47><loc_90></location>tention encoding is then multiplied to the encoded image to produce a feature for each table cell. Notice that this is different than the typical object detection problem where imbalances between the number of detections and the amount of objects may exist. In our case, we know up front that the produced detections always match with the table cells in number and correspondence.</paragraph>
|
||||
<paragraph><location><page_6><loc_8><loc_70><loc_47><loc_80></location>The output features for each table cell are then fed into the feed-forward network (FFN). The FFN consists of a Multi-Layer Perceptron (3 layers with ReLU activation function) that predicts the normalized coordinates for the bounding box of each table cell. Finally, the predicted bounding boxes are classified based on whether they are empty or not using a linear layer.</paragraph>
|
||||
@ -203,7 +204,7 @@
|
||||
<location><page_8><loc_63><loc_44><loc_89><loc_52></location>
|
||||
</figure>
|
||||
<subtitle-level-1><location><page_8><loc_8><loc_37><loc_27><loc_38></location>5.5. Qualitative Analysis</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_8><loc_50><loc_37><loc_75><loc_38></location>6. Future Work & Conclusion</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_8><loc_50><loc_37><loc_75><loc_38></location>6. Future Work &Conclusion</subtitle-level-1>
|
||||
<paragraph><location><page_8><loc_8><loc_10><loc_47><loc_32></location>We showcase several visualizations for the different components of our network on various 'complex' tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.</paragraph>
|
||||
<paragraph><location><page_8><loc_50><loc_18><loc_89><loc_35></location>In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce 'SynthTabNet' a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.</paragraph>
|
||||
<subtitle-level-1><location><page_8><loc_50><loc_14><loc_60><loc_15></location>References</subtitle-level-1>
|
||||
@ -211,25 +212,25 @@
|
||||
<paragraph><location><page_9><loc_11><loc_85><loc_47><loc_90></location>- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_81><loc_47><loc_85></location>- [2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_77><loc_47><loc_81></location>- [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_71><loc_47><loc_76></location>- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_71><loc_47><loc_76></location>- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_66><loc_47><loc_71></location>- [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_60><loc_47><loc_65></location>- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_60><loc_47><loc_65></location>- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_56><loc_47><loc_60></location>- [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_49><loc_47><loc_56></location>- [8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1</paragraph>
|
||||
<paragraph><location><page_9><loc_9><loc_45><loc_47><loc_49></location>- [9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1</paragraph>
|
||||
<paragraph><location><page_9><loc_8><loc_39><loc_47><loc_44></location>- [10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_8><loc_32><loc_47><loc_39></location>- [11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_8><loc_25><loc_47><loc_32></location>- [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_8><loc_18><loc_47><loc_25></location>- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_8><loc_18><loc_47><loc_25></location>- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_8><loc_14><loc_47><loc_18></location>- [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_8><loc_10><loc_47><loc_14></location>- [15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</paragraph>
|
||||
<paragraph><location><page_9><loc_8><loc_10><loc_47><loc_14></location>- [15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_82><loc_89><loc_90></location>- [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_78><loc_89><loc_82></location>- [17] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition, 2019. 2, 3</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_67><loc_89><loc_78></location>- [18] Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction. In Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, editors, Pattern Recognition. ICPR International Workshops and Challenges , pages 644-658, Cham, 2021. Springer International Publishing. 2, 3</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_59><loc_89><loc_67></location>- [19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_53><loc_89><loc_58></location>- [20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_45><loc_89><loc_53></location>- [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_30><loc_89><loc_44></location>- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_30><loc_89><loc_44></location>- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_21><loc_89><loc_29></location>- [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_16><loc_89><loc_21></location>- [24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3</paragraph>
|
||||
<paragraph><location><page_9><loc_50><loc_10><loc_89><loc_15></location>- [25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on</paragraph>
|
||||
@ -238,7 +239,7 @@
|
||||
<paragraph><location><page_10><loc_8><loc_71><loc_47><loc_79></location>- [27] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) , volume 1, pages 1162-1167. IEEE, 2017. 3</paragraph>
|
||||
<paragraph><location><page_10><loc_8><loc_66><loc_47><loc_71></location>- [28] Faisal Shafait and Ray Smith. Table detection in heterogeneous documents. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , pages 6572, 2010. 2</paragraph>
|
||||
<paragraph><location><page_10><loc_8><loc_59><loc_47><loc_65></location>- [29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3</paragraph>
|
||||
<paragraph><location><page_10><loc_8><loc_52><loc_47><loc_58></location>- [30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1</paragraph>
|
||||
<paragraph><location><page_10><loc_8><loc_52><loc_47><loc_58></location>- [30] Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1</paragraph>
|
||||
<paragraph><location><page_10><loc_8><loc_42><loc_47><loc_51></location>- [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5</paragraph>
|
||||
<paragraph><location><page_10><loc_8><loc_37><loc_47><loc_42></location>- [32] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015. 2</paragraph>
|
||||
<paragraph><location><page_10><loc_8><loc_31><loc_47><loc_36></location>- [33] Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: reconstruct syntactic structures from table images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 749-755. IEEE, 2019. 3</paragraph>
|
||||
@ -251,7 +252,7 @@
|
||||
<subtitle-level-1><location><page_11><loc_22><loc_83><loc_76><loc_86></location>TableFormer: Table Structure Understanding with Transformers Supplementary Material</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_11><loc_8><loc_78><loc_29><loc_80></location>1. Details on the datasets</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_11><loc_8><loc_76><loc_25><loc_77></location>1.1. Data preparation</subtitle-level-1>
|
||||
<paragraph><location><page_11><loc_8><loc_51><loc_47><loc_75></location>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.</paragraph>
|
||||
<paragraph><location><page_11><loc_8><loc_51><loc_47><loc_75></location>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). Atable is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTMLstructure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.</paragraph>
|
||||
<paragraph><location><page_11><loc_8><loc_21><loc_47><loc_51></location>We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</paragraph>
|
||||
<paragraph><location><page_11><loc_8><loc_18><loc_47><loc_20></location>Figure 7 illustrates the distribution of the tables across different dimensions per dataset.</paragraph>
|
||||
<subtitle-level-1><location><page_11><loc_8><loc_15><loc_25><loc_16></location>1.2. Synthetic datasets</subtitle-level-1>
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,3 +1,5 @@
|
||||
2022
|
||||
|
||||
## TableFormer: Table Structure Understanding with Transformers.
|
||||
|
||||
## Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research
|
||||
@ -54,7 +56,7 @@ To meet the design criteria listed above, we developed a new model called TableF
|
||||
|
||||
The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe
|
||||
|
||||
its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
|
||||
its results &performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
|
||||
|
||||
## 2. Previous work and State of the Art
|
||||
|
||||
@ -81,7 +83,7 @@ Figure 2: Distribution of the tables across different table dimensions in PubTab
|
||||
|
||||
balance in the previous datasets.
|
||||
|
||||
The PubTabNet dataset contains 509k tables delivered as annotated PNG images. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98% and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDF documents with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.
|
||||
The PubTabNet dataset contains 509k tables delivered as annotated PNGimages. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98%and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDFdocuments with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.
|
||||
|
||||
Due to the heterogeneity across the dataset formats, it was necessary to combine all available data into one homogenized dataset before we could train our models for practical purposes. Given the size of PubTabNet, we adopted its annotation format and we extracted and converted all tables as PNG images with a resolution of 72 dpi. Additionally, we have filtered out tables with extreme sizes due to small
|
||||
|
||||
@ -132,7 +134,7 @@ Structure Decoder. The transformer architecture of this component is based on th
|
||||
|
||||
The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.
|
||||
|
||||
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.
|
||||
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTMLstructure tags become the object query.
|
||||
|
||||
The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-
|
||||
|
||||
@ -260,7 +262,7 @@ Figure 5: One of the benefits of TableFormer is that it is language agnostic, as
|
||||
|
||||
## 5.5. Qualitative Analysis
|
||||
|
||||
## 6. Future Work & Conclusion
|
||||
## 6. Future Work &Conclusion
|
||||
|
||||
We showcase several visualizations for the different components of our network on various 'complex' tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.
|
||||
|
||||
@ -276,11 +278,11 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
|
||||
|
||||
- [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2
|
||||
|
||||
- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
|
||||
- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
|
||||
|
||||
- [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2
|
||||
|
||||
- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
|
||||
- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
|
||||
|
||||
- [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2
|
||||
|
||||
@ -294,11 +296,11 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
|
||||
|
||||
- [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2
|
||||
|
||||
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
|
||||
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
|
||||
|
||||
- [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2
|
||||
|
||||
- [15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
|
||||
- [15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
|
||||
|
||||
- [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4
|
||||
|
||||
@ -312,7 +314,7 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
|
||||
|
||||
- [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1
|
||||
|
||||
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
|
||||
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
|
||||
|
||||
- [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1
|
||||
|
||||
@ -330,7 +332,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
|
||||
|
||||
- [29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3
|
||||
|
||||
- [30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
|
||||
- [30] Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
|
||||
|
||||
- [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5
|
||||
|
||||
@ -356,7 +358,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
|
||||
|
||||
## 1.1. Data preparation
|
||||
|
||||
As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.
|
||||
As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). Atable is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTMLstructure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.
|
||||
|
||||
We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,4 +1,5 @@
|
||||
<document>
|
||||
<paragraph><location><page_1><loc_3><loc_74><loc_6><loc_79></location>2022</paragraph>
|
||||
<subtitle-level-1><location><page_1><loc_18><loc_85><loc_83><loc_89></location>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_15><loc_77><loc_32><loc_83></location>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</paragraph>
|
||||
<paragraph><location><page_1><loc_42><loc_77><loc_58><loc_83></location>Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com</paragraph>
|
||||
@ -24,7 +25,7 @@
|
||||
<paragraph><location><page_1><loc_52><loc_11><loc_91><loc_18></location>Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. 2022. DocLayNet: A Large Human-Annotated Dataset for DocumentLayout Analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22), August 14-18, 2022, Washington, DC, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/ 3534678.3539043</paragraph>
|
||||
<subtitle-level-1><location><page_2><loc_9><loc_88><loc_26><loc_89></location>1 INTRODUCTION</subtitle-level-1>
|
||||
<paragraph><location><page_2><loc_9><loc_71><loc_50><loc_86></location>Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.</paragraph>
|
||||
<paragraph><location><page_2><loc_9><loc_37><loc_48><loc_71></location>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</paragraph>
|
||||
<paragraph><location><page_2><loc_9><loc_37><loc_48><loc_71></location>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</paragraph>
|
||||
<paragraph><location><page_2><loc_9><loc_27><loc_48><loc_36></location>In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:</paragraph>
|
||||
<paragraph><location><page_2><loc_11><loc_22><loc_48><loc_26></location>- (1) Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.</paragraph>
|
||||
<paragraph><location><page_2><loc_11><loc_20><loc_48><loc_22></location>- (2) Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.</paragraph>
|
||||
@ -35,10 +36,10 @@
|
||||
<paragraph><location><page_2><loc_52><loc_72><loc_91><loc_79></location>All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.</paragraph>
|
||||
<paragraph><location><page_2><loc_52><loc_61><loc_91><loc_72></location>In Section 5, we will present baseline accuracy numbers for a variety of object detection methods (Faster R-CNN, Mask R-CNN and YOLOv5) trained on DocLayNet. We further show how the model performance is impacted by varying the DocLayNet dataset size, reducing the label set and modifying the train/test-split. Last but not least, we compare the performance of models trained on PubLayNet, DocBank and DocLayNet and demonstrate that a model trained on DocLayNet provides overall more robust layout recovery.</paragraph>
|
||||
<subtitle-level-1><location><page_2><loc_52><loc_58><loc_69><loc_59></location>2 RELATED WORK</subtitle-level-1>
|
||||
<paragraph><location><page_2><loc_52><loc_41><loc_91><loc_56></location>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</paragraph>
|
||||
<paragraph><location><page_2><loc_52><loc_41><loc_91><loc_56></location>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most commonapproach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</paragraph>
|
||||
<paragraph><location><page_2><loc_52><loc_30><loc_91><loc_41></location>Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.</paragraph>
|
||||
<subtitle-level-1><location><page_2><loc_52><loc_27><loc_78><loc_29></location>3 THE DOCLAYNET DATASET</subtitle-level-1>
|
||||
<paragraph><location><page_2><loc_52><loc_15><loc_91><loc_25></location>DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</paragraph>
|
||||
<paragraph><location><page_2><loc_52><loc_15><loc_91><loc_25></location>DocLayNet contains 80863 PDF pages. Amongthese, 7059 carry two instances of humanannotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</paragraph>
|
||||
<paragraph><location><page_2><loc_52><loc_11><loc_91><loc_14></location>In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents</paragraph>
|
||||
<figure>
|
||||
<location><page_3><loc_14><loc_72><loc_43><loc_88></location>
|
||||
@ -56,7 +57,7 @@
|
||||
<table>
|
||||
<location><page_4><loc_16><loc_63><loc_84><loc_83></location>
|
||||
<caption>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption>
|
||||
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
|
||||
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_5><col_6><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_11></row_0>
|
||||
<row_1><col_0><col_header>class label</col_0><col_1><col_header>Count</col_1><col_2><col_header>Train</col_2><col_3><col_header>Test</col_3><col_4><col_header>Val</col_4><col_5><col_header>All</col_5><col_6><col_header>Fin</col_6><col_7><col_header>Man</col_7><col_8><col_header>Sci</col_8><col_9><col_header>Law</col_9><col_10><col_header>Pat</col_10><col_11><col_header>Ten</col_11></row_1>
|
||||
<row_2><col_0><row_header>Caption</col_0><col_1><body>22524</col_1><col_2><body>2.04</col_2><col_3><body>1.77</col_3><col_4><body>2.32</col_4><col_5><body>84-89</col_5><col_6><body>40-61</col_6><col_7><body>86-92</col_7><col_8><body>94-99</col_8><col_9><body>95-99</col_9><col_10><body>69-78</col_10><col_11><body>n/a</col_11></row_2>
|
||||
<row_3><col_0><row_header>Footnote</col_0><col_1><body>6318</col_1><col_2><body>0.60</col_2><col_3><body>0.31</col_3><col_4><body>0.58</col_4><col_5><body>83-91</col_5><col_6><body>n/a</col_6><col_7><body>100</col_7><col_8><body>62-88</col_8><col_9><body>85-94</col_9><col_10><body>n/a</col_10><col_11><body>82-97</col_11></row_3>
|
||||
@ -77,9 +78,9 @@
|
||||
<caption>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption>
|
||||
</figure>
|
||||
<paragraph><location><page_4><loc_9><loc_15><loc_48><loc_20></location>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</paragraph>
|
||||
<paragraph><location><page_4><loc_9><loc_11><loc_48><loc_14></location><location><page_4><loc_9><loc_11><loc_48><loc_14></location>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</paragraph>
|
||||
<paragraph><location><page_4><loc_9><loc_11><loc_48><loc_14></location><location><page_4><loc_9><loc_11><loc_48><loc_14></location>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. Alarge effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</paragraph>
|
||||
<paragraph><location><page_4><loc_52><loc_36><loc_91><loc_52></location>Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.</paragraph>
|
||||
<paragraph><location><page_4><loc_52><loc_13><loc_91><loc_36></location>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</paragraph>
|
||||
<paragraph><location><page_4><loc_52><loc_13><loc_91><loc_36></location>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This wasachieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</paragraph>
|
||||
<paragraph><location><page_5><loc_9><loc_87><loc_48><loc_89></location>the textual content of an element, which goes beyond visual layout recognition, in particular outside the Scientific Articles category.</paragraph>
|
||||
<paragraph><location><page_5><loc_9><loc_69><loc_48><loc_86></location>At first sight, the task of visual document-layout interpretation appears intuitive enough to obtain plausible annotations in most cases. However, during early trial-runs in the core team, we observed many cases in which annotators use different annotation styles, especially for documents with challenging layouts. For example, if a figure is presented with subfigures, one annotator might draw a single figure bounding-box, while another might annotate each subfigure separately. The same applies for lists, where one might annotate all list items in one block or each list item separately. In essence, we observed that challenging layouts would be annotated in different but plausible ways. To illustrate this, we show in Figure 4 multiple examples of plausible but inconsistent annotations on the same pages.</paragraph>
|
||||
<paragraph><location><page_5><loc_9><loc_57><loc_48><loc_68></location>Obviously, this inconsistency in annotations is not desirable for datasets which are intended to be used for model training. To minimise these inconsistencies, we created a detailed annotation guideline. While perfect consistency across 40 annotation staff members is clearly not possible to achieve, we saw a huge improvement in annotation consistency after the introduction of our annotation guideline. A few selected, non-trivial highlights of the guideline are:</paragraph>
|
||||
@ -97,8 +98,8 @@
|
||||
<paragraph><location><page_5><loc_65><loc_42><loc_78><loc_42></location>05237a14f2524e3f53c8454b074409d05078038a6a36b770fcc8ec7e540deae0</paragraph>
|
||||
<caption><location><page_5><loc_52><loc_36><loc_91><loc_40></location>Figure 4: Examples of plausible annotation alternatives for the same page. Criteria in our annotation guideline can resolve cases A to C, while the case D remains ambiguous.</caption>
|
||||
<paragraph><location><page_5><loc_52><loc_31><loc_91><loc_34></location>were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.</paragraph>
|
||||
<paragraph><location><page_5><loc_52><loc_10><loc_91><loc_31></location>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted</paragraph>
|
||||
<paragraph><location><page_6><loc_9><loc_77><loc_48><loc_89></location>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</paragraph>
|
||||
<paragraph><location><page_5><loc_52><loc_10><loc_91><loc_31></location>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDFtext-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). Wewanted</paragraph>
|
||||
<paragraph><location><page_6><loc_9><loc_77><loc_48><loc_89></location>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLOimplementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</paragraph>
|
||||
<table>
|
||||
<location><page_6><loc_10><loc_56><loc_47><loc_75></location>
|
||||
<row_0><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>MRCNN</col_2><col_3><col_header>MRCNN</col_3><col_4><col_header>FRCNN</col_4><col_5><col_header>YOLO</col_5></row_0>
|
||||
@ -118,10 +119,10 @@
|
||||
</table>
|
||||
<paragraph><location><page_6><loc_9><loc_27><loc_48><loc_53></location>to avoid this at any cost in order to have clear, unbiased baseline numbers for human document-layout annotation. Third, we introduced the feature of snapping boxes around text segments to obtain a pixel-accurate annotation and again reduce time and effort. The CCS annotation tool automatically shrinks every user-drawn box to the minimum bounding-box around the enclosed text-cells for all purely text-based segments, which excludes only Table and Picture . For the latter, we instructed annotation staff to minimise inclusion of surrounding whitespace while including all graphical lines. A downside of snapping boxes to enclosed text cells is that some wrongly parsed PDF pages cannot be annotated correctly and need to be skipped. Fourth, we established a way to flag pages as rejected for cases where no valid annotation according to the label guidelines could be achieved. Example cases for this would be PDF pages that render incorrectly or contain layouts that are impossible to capture with non-overlapping rectangles. Such rejected pages are not contained in the final dataset. With all these measures in place, experienced annotation staff managed to annotate a single page in a typical timeframe of 20s to 60s, depending on its complexity.</paragraph>
|
||||
<subtitle-level-1><location><page_6><loc_9><loc_24><loc_24><loc_26></location>5 EXPERIMENTS</subtitle-level-1>
|
||||
<paragraph><location><page_6><loc_9><loc_10><loc_48><loc_23></location>The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</paragraph>
|
||||
<paragraph><location><page_6><loc_9><loc_10><loc_48><loc_23></location>The primary goal of DocLayNet is to obtain high-quality MLmodels capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</paragraph>
|
||||
<figure>
|
||||
<location><page_6><loc_53><loc_67><loc_90><loc_89></location>
|
||||
<caption>Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNet dataset with similar data will not yield significantly better predictions.</caption>
|
||||
<caption>Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNetdataset with similar data will not yield significantly better predictions.</caption>
|
||||
</figure>
|
||||
<paragraph><location><page_6><loc_52><loc_49><loc_91><loc_51></location>paper and leave the detailed evaluation of more recent methods mentioned in Section 2 for future work.</paragraph>
|
||||
<paragraph><location><page_6><loc_52><loc_39><loc_91><loc_49></location>In this section, we will present several aspects related to the performance of object detection models on DocLayNet. Similarly as in PubLayNet, we will evaluate the quality of their predictions using mean average precision (mAP) with 10 overlaps that range from 0.5 to 0.95 in steps of 0.05 (mAP@0.5-0.95). These scores are computed by leveraging the evaluation code provided by the COCO API [16].</paragraph>
|
||||
@ -168,10 +169,10 @@
|
||||
</table>
|
||||
<paragraph><location><page_7><loc_52><loc_47><loc_91><loc_58></location>lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items), the label set of size 4 is the closest to PubLayNet, in the assumption that the List is down-mapped to Text in PubLayNet. The results in Table 3 show that the prediction accuracy on the remaining class labels does not change significantly when other classes are merged into them. The overall macro-average improves by around 5%, in particular when Page-footer and Page-header are excluded.</paragraph>
|
||||
<subtitle-level-1><location><page_7><loc_52><loc_45><loc_90><loc_46></location>Impact of Document Split in Train and Test Set</subtitle-level-1>
|
||||
<paragraph><location><page_7><loc_52><loc_25><loc_91><loc_44></location>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</paragraph>
|
||||
<paragraph><location><page_7><loc_52><loc_25><loc_91><loc_44></location>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</paragraph>
|
||||
<subtitle-level-1><location><page_7><loc_52><loc_22><loc_68><loc_23></location>Dataset Comparison</subtitle-level-1>
|
||||
<paragraph><location><page_7><loc_52><loc_11><loc_91><loc_21></location>Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,</paragraph>
|
||||
<paragraph><location><page_8><loc_9><loc_81><loc_48><loc_89></location>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.</paragraph>
|
||||
<paragraph><location><page_8><loc_9><loc_81><loc_48><loc_89></location>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.</paragraph>
|
||||
<table>
|
||||
<location><page_8><loc_12><loc_57><loc_45><loc_78></location>
|
||||
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>Testing on</col_2><col_3><col_header>Testing on</col_3><col_4><col_header>Testing on</col_4></row_0>
|
||||
@ -191,7 +192,7 @@
|
||||
<row_14><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
|
||||
</table>
|
||||
<paragraph><location><page_8><loc_9><loc_44><loc_48><loc_51></location>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</paragraph>
|
||||
<paragraph><location><page_8><loc_9><loc_26><loc_48><loc_44></location>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</paragraph>
|
||||
<paragraph><location><page_8><loc_9><loc_26><loc_48><loc_44></location>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. Wehad to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</paragraph>
|
||||
<subtitle-level-1><location><page_8><loc_9><loc_22><loc_25><loc_23></location>Example Predictions</subtitle-level-1>
|
||||
<paragraph><location><page_8><loc_9><loc_11><loc_48><loc_22></location>To conclude this section, we illustrate the quality of layout predictions one can expect from DocLayNet-trained models by providing a selection of examples without any further post-processing applied. Figure 6 shows selected layout predictions on pages from the test-set of DocLayNet. Results look decent in general across document categories, however one can also observe mistakes such as overlapping clusters of different classes, or entirely missing boxes due to low confidence.</paragraph>
|
||||
<subtitle-level-1><location><page_8><loc_52><loc_88><loc_66><loc_89></location>6 CONCLUSION</subtitle-level-1>
|
||||
@ -202,7 +203,7 @@
|
||||
<paragraph><location><page_8><loc_52><loc_53><loc_91><loc_56></location>- [1] Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.</paragraph>
|
||||
<paragraph><location><page_8><loc_52><loc_49><loc_91><loc_53></location>- [2] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.</paragraph>
|
||||
<paragraph><location><page_8><loc_52><loc_46><loc_91><loc_49></location>- [3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.</paragraph>
|
||||
<paragraph><location><page_8><loc_52><loc_42><loc_91><loc_46></location>- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.</paragraph>
|
||||
<paragraph><location><page_8><loc_52><loc_42><loc_91><loc_46></location>- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS12824, SpringerVerlag, sep 2021.</paragraph>
|
||||
<paragraph><location><page_8><loc_52><loc_38><loc_91><loc_42></location>- [5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.</paragraph>
|
||||
<paragraph><location><page_8><loc_52><loc_35><loc_91><loc_38></location>- [6] Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 1015-1022, sep 2019.</paragraph>
|
||||
<paragraph><location><page_8><loc_52><loc_30><loc_91><loc_35></location>- [7] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics , COLING, pages 949-960. International Committee on Computational Linguistics, dec 2020.</paragraph>
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,3 +1,5 @@
|
||||
2022
|
||||
|
||||
## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
|
||||
|
||||
Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com
|
||||
@ -43,7 +45,7 @@ Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staa
|
||||
|
||||
Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.
|
||||
|
||||
Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
|
||||
Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
|
||||
|
||||
In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:
|
||||
|
||||
@ -65,13 +67,13 @@ In Section 5, we will present baseline accuracy numbers for a variety of object
|
||||
|
||||
## 2 RELATED WORK
|
||||
|
||||
While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].
|
||||
While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most commonapproach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].
|
||||
|
||||
Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.
|
||||
|
||||
## 3 THE DOCLAYNET DATASET
|
||||
|
||||
DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.
|
||||
DocLayNet contains 80863 PDF pages. Amongthese, 7059 carry two instances of humanannotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.
|
||||
|
||||
In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents
|
||||
|
||||
@ -98,32 +100,32 @@ The annotation campaign was carried out in four phases. In phase one, we identif
|
||||
|
||||
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.
|
||||
|
||||
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|
||||
|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
|
||||
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
|
||||
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
|
||||
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
|
||||
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
|
||||
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
|
||||
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
|
||||
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
|
||||
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
|
||||
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
|
||||
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
|
||||
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
|
||||
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
|
||||
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
|
||||
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) |
|
||||
|----------------|---------|--------------|--------------|--------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|
|
||||
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
|
||||
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
|
||||
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
|
||||
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
|
||||
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
|
||||
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
|
||||
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
|
||||
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
|
||||
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
|
||||
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
|
||||
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
|
||||
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
|
||||
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
|
||||
|
||||
Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
|
||||
<!-- image -->
|
||||
|
||||
we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.
|
||||
|
||||
Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.
|
||||
Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. Alarge effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.
|
||||
|
||||
Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.
|
||||
|
||||
Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
|
||||
Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This wasachieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
|
||||
|
||||
the textual content of an element, which goes beyond visual layout recognition, in particular outside the Scientific Articles category.
|
||||
|
||||
@ -155,9 +157,9 @@ Figure 4: Examples of plausible annotation alternatives for the same page. Crite
|
||||
|
||||
were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.
|
||||
|
||||
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted
|
||||
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDFtext-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). Wewanted
|
||||
|
||||
Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.
|
||||
Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLOimplementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.
|
||||
|
||||
| | human | MRCNN | MRCNN | FRCNN | YOLO |
|
||||
|----------------|---------|---------|---------|---------|--------|
|
||||
@ -179,9 +181,9 @@ to avoid this at any cost in order to have clear, unbiased baseline numbers for
|
||||
|
||||
## 5 EXPERIMENTS
|
||||
|
||||
The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this
|
||||
The primary goal of DocLayNet is to obtain high-quality MLmodels capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this
|
||||
|
||||
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNet dataset with similar data will not yield significantly better predictions.
|
||||
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNetdataset with similar data will not yield significantly better predictions.
|
||||
<!-- image -->
|
||||
|
||||
paper and leave the detailed evaluation of more recent methods mentioned in Section 2 for future work.
|
||||
@ -239,13 +241,13 @@ lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items),
|
||||
|
||||
## Impact of Document Split in Train and Test Set
|
||||
|
||||
Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
|
||||
Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
|
||||
|
||||
## Dataset Comparison
|
||||
|
||||
Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,
|
||||
|
||||
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.
|
||||
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.
|
||||
|
||||
| | | Testing on | Testing on | Testing on |
|
||||
|-----------------|------------|--------------|--------------|--------------|
|
||||
@ -266,7 +268,7 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
|
||||
|
||||
Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .
|
||||
|
||||
For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.
|
||||
For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. Wehad to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.
|
||||
|
||||
## Example Predictions
|
||||
|
||||
@ -288,7 +290,7 @@ To date, there is still a significant gap between human and ML accuracy on the l
|
||||
|
||||
- [3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.
|
||||
|
||||
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.
|
||||
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS12824, SpringerVerlag, sep 2021.
|
||||
|
||||
- [5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -1,4 +1,5 @@
|
||||
<document>
|
||||
<paragraph><location><page_1><loc_3><loc_74><loc_6><loc_79></location>2023</paragraph>
|
||||
<subtitle-level-1><location><page_1><loc_22><loc_82><loc_79><loc_85></location>Optimized Table Tokenization for Table Structure Recognition</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_23><loc_75><loc_78><loc_79></location>Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]</paragraph>
|
||||
<paragraph><location><page_1><loc_38><loc_74><loc_49><loc_75></location>and Peter Staar</paragraph>
|
||||
@ -21,7 +22,7 @@
|
||||
<subtitle-level-1><location><page_3><loc_22><loc_40><loc_39><loc_42></location>2 Related Work</subtitle-level-1>
|
||||
<paragraph><location><page_3><loc_22><loc_16><loc_79><loc_38></location>Approaches to formalize the logical structure and layout of tables in electronic documents date back more than two decades [16]. In the recent past, a wide variety of computer vision methods have been explored to tackle the problem of table structure recognition, i.e. the correct identification of columns, rows and spanning cells in a given table. Broadly speaking, the current deeplearning based approaches fall into three categories: object detection (OD) methods, Graph-Neural-Network (GNN) methods and Image-to-Markup-Sequence (Im2Seq) methods. Object-detection based methods [11,12,13,14,21] rely on tablestructure annotation using (overlapping) bounding boxes for training, and produce bounding-box predictions to define table cells, rows, and columns on a table image. Graph Neural Network (GNN) based methods [3,6,17,18], as the name suggests, represent tables as graph structures. The graph nodes represent the content of each table cell, an embedding vector from the table image, or geometric coordinates of the table cell. The edges of the graph define the relationship between the nodes, e.g. if they belong to the same column, row, or table cell.</paragraph>
|
||||
<paragraph><location><page_4><loc_22><loc_67><loc_79><loc_85></location>Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.</paragraph>
|
||||
<paragraph><location><page_4><loc_22><loc_39><loc_79><loc_67></location>Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.</paragraph>
|
||||
<paragraph><location><page_4><loc_22><loc_39><loc_79><loc_67></location>Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCRand uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.</paragraph>
|
||||
<paragraph><location><page_4><loc_22><loc_26><loc_79><loc_38></location>Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.</paragraph>
|
||||
<subtitle-level-1><location><page_4><loc_22><loc_22><loc_44><loc_24></location>3 Problem Statement</subtitle-level-1>
|
||||
<paragraph><location><page_4><loc_22><loc_16><loc_79><loc_20></location>All known Im2Seq based models for TSR fundamentally work in similar ways. Given an image of a table, the Im2Seq model predicts the structure of the table by generating a sequence of tokens. These tokens originate from a finite vocab-</paragraph>
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,3 +1,5 @@
|
||||
2023
|
||||
|
||||
## Optimized Table Tokenization for Table Structure Recognition
|
||||
|
||||
Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]
|
||||
@ -37,7 +39,7 @@ Approaches to formalize the logical structure and layout of tables in electronic
|
||||
|
||||
Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.
|
||||
|
||||
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
|
||||
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCRand uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
|
||||
|
||||
Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,21 +1,21 @@
|
||||
<document>
|
||||
<paragraph><location><page_1><loc_12><loc_88><loc_52><loc_94></location>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_77><loc_52><loc_86></location>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_77><loc_52><loc_86></location>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; andthe elastic stop nut, representing the fiber insert type.</paragraph>
|
||||
<subtitle-level-1><location><page_1><loc_12><loc_74><loc_28><loc_75></location>Boots Self-Locking Nut</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_12><loc_64><loc_52><loc_73></location>nut is of one piece, all-metal The Boots self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_64><loc_52><loc_73></location>The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_52><loc_52><loc_61></location>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_38><loc_52><loc_50></location>The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_38><loc_52><loc_50></location>The spring, through the mediumofthe locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_33><loc_52><loc_36></location>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</paragraph>
|
||||
<figure>
|
||||
<location><page_1><loc_12><loc_10><loc_52><loc_31></location>
|
||||
<caption>Figure 7-26. Self-locking nuts.</caption>
|
||||
</figure>
|
||||
<paragraph><location><page_1><loc_54><loc_85><loc_94><loc_94></location>the most common ranges in size for No. 6 up to 1 4 inch, the / Rol-top ranges from 1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</paragraph>
|
||||
<paragraph><location><page_1><loc_54><loc_83><loc_55><loc_84></location>.</paragraph>
|
||||
<paragraph><location><page_1><loc_54><loc_85><loc_94><loc_94></location>the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</paragraph>
|
||||
<paragraph><location><page_1><loc_54><loc_83><loc_54><loc_84></location>.</paragraph>
|
||||
<subtitle-level-1><location><page_1><loc_54><loc_82><loc_76><loc_83></location>Stainless Steel Self-Locking Nut</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_54><loc_54><loc_94><loc_81></location>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</paragraph>
|
||||
<paragraph><location><page_1><loc_54><loc_54><loc_94><loc_81></location>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</paragraph>
|
||||
<subtitle-level-1><location><page_1><loc_54><loc_51><loc_65><loc_52></location>Elastic Stop Nut</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_54><loc_47><loc_94><loc_50></location>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</paragraph>
|
||||
<paragraph><location><page_1><loc_54><loc_47><loc_94><loc_50></location>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</paragraph>
|
||||
<figure>
|
||||
<location><page_1><loc_54><loc_11><loc_94><loc_46></location>
|
||||
<caption>Figure 7-27. Stainless steel self-locking nut.</caption>
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,31 +1,31 @@
|
||||
pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.
|
||||
|
||||
The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.
|
||||
The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; andthe elastic stop nut, representing the fiber insert type.
|
||||
|
||||
## Boots Self-Locking Nut
|
||||
|
||||
nut is of one piece, all-metal The Boots self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.
|
||||
The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.
|
||||
|
||||
The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.
|
||||
|
||||
The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.
|
||||
The spring, through the mediumofthe locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.
|
||||
|
||||
Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is
|
||||
|
||||
Figure 7-26. Self-locking nuts.
|
||||
<!-- image -->
|
||||
|
||||
the most common ranges in size for No. 6 up to 1 4 inch, the / Rol-top ranges from 1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.
|
||||
the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.
|
||||
|
||||
.
|
||||
|
||||
## Stainless Steel Self-Locking Nut
|
||||
|
||||
The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.
|
||||
The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.
|
||||
|
||||
## Elastic Stop Nut
|
||||
|
||||
The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This
|
||||
The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This
|
||||
|
||||
Figure 7-27. Stainless steel self-locking nut.
|
||||
<!-- image -->
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -3,7 +3,7 @@
|
||||
<figure>
|
||||
<location><page_1><loc_84><loc_93><loc_96><loc_97></location>
|
||||
</figure>
|
||||
<subtitle-level-1><location><page_1><loc_6><loc_79><loc_96><loc_89></location>Row and Column Access Control Support in IBM DB2 for i</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_1><loc_6><loc_79><loc_94><loc_89></location>RowandColumnAccessControl Support in IBM DB2 for i</subtitle-level-1>
|
||||
<figure>
|
||||
<location><page_1><loc_5><loc_11><loc_96><loc_63></location>
|
||||
</figure>
|
||||
@ -16,19 +16,19 @@
|
||||
<figure>
|
||||
<location><page_3><loc_23><loc_64><loc_29><loc_66></location>
|
||||
</figure>
|
||||
<subtitle-level-1><location><page_3><loc_24><loc_57><loc_31><loc_59></location>Highlights</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_3><loc_24><loc_57><loc_30><loc_59></location>Highlights</subtitle-level-1>
|
||||
<paragraph><location><page_3><loc_24><loc_55><loc_40><loc_57></location>- /g115/g3 /g40/g81/g75/g68/g81/g70/g72/g3 /g87/g75/g72/g3 /g83/g72/g85/g73/g82/g85/g80/g68/g81/g70/g72/g3 /g82/g73/g3 /g92/g82/g88/g85/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g82/g83/g72/g85/g68/g87/g76/g82/g81/g86</paragraph>
|
||||
<paragraph><location><page_3><loc_24><loc_51><loc_42><loc_54></location>- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85 /g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86</paragraph>
|
||||
<paragraph><location><page_3><loc_24><loc_51><loc_42><loc_54></location>- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85/g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86</paragraph>
|
||||
<paragraph><location><page_3><loc_24><loc_48><loc_41><loc_50></location>- /g115/g3 /g53/g72/g79/g92/g3 /g82/g81/g3 /g44/g37/g48/g3 /g72/g91/g83/g72/g85/g87/g3 /g70/g82/g81/g86/g88/g79/g87/g76/g81/g74/g15/g3 /g86/g78/g76/g79/g79/g86/g3 /g86/g75/g68/g85/g76/g81/g74/g3 /g68/g81/g71/g3 /g85/g72/g81/g82/g90/g81/g3 /g86/g72/g85/g89/g76/g70/g72/g86</paragraph>
|
||||
<paragraph><location><page_3><loc_24><loc_45><loc_38><loc_47></location>- /g115/g3 /g55 /g68/g78/g72/g3 /g68/g71/g89/g68/g81/g87/g68/g74/g72/g3 /g82/g73/g3 /g68/g70/g70/g72/g86/g86/g3 /g87/g82/g3 /g68/g3 /g90/g82/g85/g79/g71/g90/g76/g71/g72/g3 /g86/g82/g88/g85/g70/g72/g3 /g82/g73/g3 /g72/g91/g83/g72/g85/g87/g76/g86/g72</paragraph>
|
||||
<figure>
|
||||
<location><page_3><loc_10><loc_13><loc_42><loc_24></location>
|
||||
</figure>
|
||||
<paragraph><location><page_3><loc_75><loc_82><loc_83><loc_83></location>Power Services</paragraph>
|
||||
<subtitle-level-1><location><page_3><loc_46><loc_65><loc_76><loc_71></location>DB2 for i Center of Excellence</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_3><loc_46><loc_65><loc_75><loc_71></location>DB2 for i Center of Excellence</subtitle-level-1>
|
||||
<paragraph><location><page_3><loc_46><loc_64><loc_79><loc_65></location>Expert help to achieve your business requirements</paragraph>
|
||||
<subtitle-level-1><location><page_3><loc_46><loc_59><loc_72><loc_60></location>We build confident, satisfied clients</subtitle-level-1>
|
||||
<paragraph><location><page_3><loc_46><loc_56><loc_80><loc_59></location>No one else has the vast consulting experiences, skills sharing and renown service offerings to do what we can do for you.</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_56><loc_79><loc_59></location>No one else has the vast consulting experiences, skills sharing and renown service offerings to do what we can do for you.</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_54><loc_60><loc_55></location>Because no one else is IBM.</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_46><loc_82><loc_52></location>With combined experiences and direct access to development groups, we're the experts in IBM DB2® for i. The DB2 for i Center of Excellence (CoE) can help you achieve-perhaps reexamine and exceed-your business requirements and gain more confidence and satisfaction in IBM product data management products and solutions.</paragraph>
|
||||
<subtitle-level-1><location><page_3><loc_46><loc_44><loc_71><loc_45></location>Who we are, some of what we do</subtitle-level-1>
|
||||
@ -36,7 +36,7 @@
|
||||
<paragraph><location><page_3><loc_46><loc_40><loc_66><loc_41></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Database performance and scalability</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_39><loc_69><loc_39></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Advanced SQL knowledge and skills transfer</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_37><loc_64><loc_38></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Business intelligence and analytics</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_36><loc_56><loc_37></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> DB2 Web Query</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_36><loc_56><loc_37></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> DB2Web Query</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_35><loc_82><loc_36></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Query/400 modernization for better reporting and analysis capabilities</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_33><loc_69><loc_34></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Database modernization and re-engineering</paragraph>
|
||||
<paragraph><location><page_3><loc_46><loc_32><loc_65><loc_33></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Data-centric architecture and design</paragraph>
|
||||
@ -49,7 +49,7 @@
|
||||
<figure>
|
||||
<location><page_4><loc_23><loc_36><loc_41><loc_53></location>
|
||||
</figure>
|
||||
<paragraph><location><page_4><loc_43><loc_35><loc_89><loc_53></location>Jim Bainbridge is a senior DB2 consultant on the DB2 for i Center of Excellence team in the IBM Lab Services and Training organization. His primary role is training and implementation services for IBM DB2 Web Query for i and business analytics. Jim began his career with IBM 30 years ago in the IBM Rochester Development Lab, where he developed cooperative processing products that paired IBM PCs with IBM S/36 and AS/.400 systems. In the years since, Jim has held numerous technical roles, including independent software vendors technical support on a broad range of IBM technologies and products, and supporting customers in the IBM Executive Briefing Center and IBM Project Office.</paragraph>
|
||||
<paragraph><location><page_4><loc_43><loc_35><loc_88><loc_53></location>Jim Bainbridge is a senior DB2 consultant on the DB2 for i Center of Excellence team in the IBM Lab Services and Training organization. His primary role is training and implementation services for IBM DB2 Web Query for i and business analytics. Jim began his career with IBM 30 years ago in the IBM Rochester Development Lab, where he developed cooperative processing products that paired IBM PCs with IBM S/36 and AS/.400 systems. In the years since, Jim has held numerous technical roles, including independent software vendors technical support on a broad range of IBM technologies and products, and supporting customers in the IBM Executive Briefing Center and IBM Project Office.</paragraph>
|
||||
<figure>
|
||||
<location><page_4><loc_24><loc_20><loc_41><loc_33></location>
|
||||
</figure>
|
||||
@ -60,77 +60,81 @@
|
||||
</figure>
|
||||
<paragraph><location><page_5><loc_82><loc_84><loc_85><loc_88></location>1</paragraph>
|
||||
<paragraph><location><page_5><loc_13><loc_65><loc_19><loc_66></location>Chapter 1.</paragraph>
|
||||
<subtitle-level-1><location><page_5><loc_22><loc_61><loc_90><loc_68></location>Securing and protecting IBM DB2 data</subtitle-level-1>
|
||||
<paragraph><location><page_5><loc_22><loc_46><loc_89><loc_56></location>Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.</paragraph>
|
||||
<subtitle-level-1><location><page_5><loc_22><loc_61><loc_89><loc_68></location>Securing and protecting IBM DB2 data</subtitle-level-1>
|
||||
<paragraph><location><page_5><loc_22><loc_46><loc_89><loc_56></location>Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.</paragraph>
|
||||
<paragraph><location><page_5><loc_22><loc_38><loc_86><loc_44></location>Businesses must make a serious effort to secure their data and recognize that securing information assets is a cost of doing business. In many parts of the world and in many industries, securing the data is required by law and subject to audits. Data security is no longer an option; it is a requirement.</paragraph>
|
||||
<paragraph><location><page_5><loc_22><loc_34><loc_89><loc_37></location>This chapter describes how you can secure and protect data in DB2 for i. The following topics are covered in this chapter:</paragraph>
|
||||
<paragraph><location><page_5><loc_22><loc_32><loc_41><loc_33></location>- /SM590000 Security fundamentals</paragraph>
|
||||
<paragraph><location><page_5><loc_22><loc_30><loc_46><loc_32></location>- /SM590000 Current state of IBM i security</paragraph>
|
||||
<paragraph><location><page_5><loc_22><loc_29><loc_43><loc_30></location>- /SM590000 DB2 for i security controls</paragraph>
|
||||
<subtitle-level-1><location><page_6><loc_11><loc_89><loc_44><loc_91></location>1.1 Security fundamentals</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_6><loc_11><loc_89><loc_44><loc_91></location>1.1 Security fundamentals</subtitle-level-1>
|
||||
<paragraph><location><page_6><loc_22><loc_84><loc_89><loc_87></location>Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:</paragraph>
|
||||
<paragraph><location><page_6><loc_22><loc_77><loc_89><loc_83></location>- /SM590000 First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.</paragraph>
|
||||
<paragraph><location><page_6><loc_25><loc_66><loc_89><loc_76></location>- The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.</paragraph>
|
||||
<paragraph><location><page_6><loc_25><loc_64><loc_89><loc_65></location>A security policy is what defines whether the system and its settings are secure (or not).</paragraph>
|
||||
<paragraph><location><page_6><loc_25><loc_64><loc_88><loc_65></location>A security policy is what defines whether the system and its settings are secure (or not).</paragraph>
|
||||
<paragraph><location><page_6><loc_22><loc_53><loc_89><loc_63></location>- /SM590000 The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.</paragraph>
|
||||
<paragraph><location><page_6><loc_22><loc_48><loc_87><loc_51></location>With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.</paragraph>
|
||||
<subtitle-level-1><location><page_6><loc_11><loc_43><loc_53><loc_45></location>1.2 Current state of IBM i security</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_6><loc_11><loc_43><loc_53><loc_45></location>1.2 Current state of IBM i security</subtitle-level-1>
|
||||
<paragraph><location><page_6><loc_22><loc_35><loc_89><loc_41></location>Because of the inherently secure nature of IBM i, many clients rely on the default system settings to protect their business data that is stored in DB2 for i. In most cases, this means no data protection because the default setting for the Create default public authority (QCRTAUT) system value is *CHANGE.</paragraph>
|
||||
<paragraph><location><page_6><loc_22><loc_26><loc_89><loc_33></location>Even more disturbing is that many IBM i clients remain in this state, despite the news headlines and the significant costs that are involved with databases being compromised. This default security configuration makes it quite challenging to implement basic security policies. A tighter implementation is required if you really want to protect one of your company's most valuable assets, which is the data.</paragraph>
|
||||
<paragraph><location><page_6><loc_22><loc_14><loc_89><loc_24></location>Traditionally, IBM i applications have employed menu-based security to counteract this default configuration that gives all users access to the data. The theory is that data is protected by the menu options controlling what database operations that the user can perform. This approach is ineffective, even if the user profile is restricted from running interactive commands. The reason is that in today's connected world there are a multitude of interfaces into the system, from web browsers to PC clients, that bypass application menus. If there are no object-level controls, users of these newer interfaces have an open door to your data.</paragraph>
|
||||
<paragraph><location><page_7><loc_22><loc_81><loc_89><loc_91></location>Many businesses are trying to limit data access to a need-to-know basis. This security goal means that users should be given access only to the minimum set of data that is required to perform their job. Often, users with object-level access are given access to row and column values that are beyond what their business task requires because that object-level security provides an all-or-nothing solution. For example, object-level controls allow a manager to access data about all employees. Most security policies limit a manager to accessing data only for the employees that they manage.</paragraph>
|
||||
<subtitle-level-1><location><page_7><loc_11><loc_77><loc_49><loc_78></location>1.3.1 Existing row and column control</subtitle-level-1>
|
||||
<paragraph><location><page_7><loc_22><loc_81><loc_88><loc_91></location>Many businesses are trying to limit data access to a need-to-know basis. This security goal means that users should be given access only to the minimum set of data that is required to perform their job. Often, users with object-level access are given access to row and column values that are beyond what their business task requires because that object-level security provides an all-or-nothing solution. For example, object-level controls allow a manager to access data about all employees. Most security policies limit a manager to accessing data only for the employees that they manage.</paragraph>
|
||||
<subtitle-level-1><location><page_7><loc_11><loc_77><loc_49><loc_78></location>1.3.1 Existing row and column control</subtitle-level-1>
|
||||
<paragraph><location><page_7><loc_22><loc_68><loc_88><loc_75></location>Some IBM i clients have tried augmenting the all-or-nothing object-level security with SQL views (or logical files) and application logic, as shown in Figure 1-2. However, application-based logic is easy to bypass with all of the different data access interfaces that are provided by the IBM i operating system, such as Open Database Connectivity (ODBC) and System i Navigator.</paragraph>
|
||||
<paragraph><location><page_7><loc_22><loc_60><loc_89><loc_66></location>Using SQL views to limit access to a subset of the data in a table also has its own set of challenges. First, there is the complexity of managing all of the SQL view objects that are used for securing data access. Second, scaling a view-based security solution can be difficult as the amount of data grows and the number of users increases.</paragraph>
|
||||
<paragraph><location><page_7><loc_22><loc_54><loc_89><loc_59></location>Even if you are willing to live with these performance and management issues, a user with *ALLOBJ access still can directly access all of the data in the underlying DB2 table and easily bypass the security controls that are built into an SQL view.</paragraph>
|
||||
<figure>
|
||||
<location><page_7><loc_22><loc_13><loc_89><loc_53></location>
|
||||
<caption>Figure 1-2 Existing row and column controls</caption>
|
||||
<caption>Figure 1-2 Existing row and column controls</caption>
|
||||
</figure>
|
||||
<subtitle-level-1><location><page_8><loc_11><loc_89><loc_55><loc_91></location>2.1.6 Change Function Usage CL command</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_8><loc_11><loc_89><loc_55><loc_91></location>2.1.6 Change Function Usage CL command</subtitle-level-1>
|
||||
<paragraph><location><page_8><loc_22><loc_87><loc_89><loc_88></location>The following CL commands can be used to work with, display, or change function usage IDs:</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_84><loc_49><loc_86></location>- /SM590000 Work Function Usage ( WRKFCNUSG )</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_83><loc_51><loc_84></location>- /SM590000 Change Function Usage ( CHGFCNUSG )</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_81><loc_51><loc_83></location>- /SM590000 Display Function Usage ( DSPFCNUSG )</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_77><loc_84><loc_80></location>For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_77><loc_83><loc_80></location>For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_75><loc_72><loc_76></location>CHGFCNUSG FCNID(QIBM_DB_SECADM) USER(HBEDOYA) USAGE(*ALLOWED)</paragraph>
|
||||
<subtitle-level-1><location><page_8><loc_11><loc_71><loc_89><loc_72></location>2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view</subtitle-level-1>
|
||||
<paragraph><location><page_8><loc_22><loc_66><loc_85><loc_69></location>The FUNCTION_USAGE view contains function usage configuration details. Table 2-1 describes the columns in the FUNCTION_USAGE view.</paragraph>
|
||||
<subtitle-level-1><location><page_8><loc_11><loc_71><loc_89><loc_72></location>2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view</subtitle-level-1>
|
||||
<paragraph><location><page_8><loc_22><loc_66><loc_84><loc_69></location>The FUNCTION_USAGE view contains function usage configuration details. Table 2-1 describes the columns in the FUNCTION_USAGE view.</paragraph>
|
||||
<table>
|
||||
<location><page_8><loc_22><loc_44><loc_89><loc_63></location>
|
||||
<caption>Table 2-1 FUNCTION_USAGE view</caption>
|
||||
<caption>Table 2-1 FUNCTION_USAGE view</caption>
|
||||
<row_0><col_0><col_header>Column name</col_0><col_1><col_header>Data type</col_1><col_2><col_header>Description</col_2></row_0>
|
||||
<row_1><col_0><body>FUNCTION_ID</col_0><col_1><body>VARCHAR(30)</col_1><col_2><body>ID of the function.</col_2></row_1>
|
||||
<row_2><col_0><body>USER_NAME</col_0><col_1><body>VARCHAR(10)</col_1><col_2><body>Name of the user profile that has a usage setting for this function.</col_2></row_2>
|
||||
<row_2><col_0><body>USER_NAME</col_0><col_1><body>VARCHAR(10)</col_1><col_2><body>Name of the user profile that has a usage setting for this function.</col_2></row_2>
|
||||
<row_3><col_0><body>USAGE</col_0><col_1><body>VARCHAR(7)</col_1><col_2><body>Usage setting: /SM590000 ALLOWED: The user profile is allowed to use the function. /SM590000 DENIED: The user profile is not allowed to use the function.</col_2></row_3>
|
||||
<row_4><col_0><body>USER_TYPE</col_0><col_1><body>VARCHAR(5)</col_1><col_2><body>Type of user profile: /SM590000 USER: The user profile is a user. /SM590000 GROUP: The user profile is a group.</col_2></row_4>
|
||||
</table>
|
||||
<caption><location><page_8><loc_22><loc_64><loc_46><loc_65></location>Table 2-1 FUNCTION_USAGE view</caption>
|
||||
<caption><location><page_8><loc_22><loc_64><loc_46><loc_65></location>Table 2-1 FUNCTION_USAGE view</caption>
|
||||
<paragraph><location><page_8><loc_22><loc_40><loc_89><loc_43></location>To discover who has authorization to define and manage RCAC, you can use the query that is shown in Example 2-1.</paragraph>
|
||||
<caption><location><page_8><loc_22><loc_38><loc_76><loc_39></location>Example 2-1 Query to determine who has authority to define and manage RCAC</caption>
|
||||
<paragraph><location><page_8><loc_22><loc_35><loc_41><loc_36></location>SELECT function_id,</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_34><loc_39><loc_35></location>user_name,</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_32><loc_36><loc_33></location>usage,</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_31><loc_39><loc_32></location>user_type</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_29><loc_43><loc_30></location>FROM function_usage</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_28><loc_54><loc_29></location>WHERE function_id='QIBM_DB_SECADM'</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_26><loc_39><loc_27></location>ORDER BY user_name;</paragraph>
|
||||
<subtitle-level-1><location><page_8><loc_11><loc_20><loc_41><loc_22></location>2.2 Separation of duties</subtitle-level-1>
|
||||
<caption><location><page_8><loc_22><loc_38><loc_76><loc_39></location>Example 2-1 Query to determine who has authority to define and manage RCAC</caption>
|
||||
<paragraph><location><page_8><loc_22><loc_35><loc_27><loc_36></location>SELECT</paragraph>
|
||||
<paragraph><location><page_8><loc_31><loc_35><loc_41><loc_36></location>function_id,</paragraph>
|
||||
<paragraph><location><page_8><loc_31><loc_34><loc_39><loc_35></location>user_name,</paragraph>
|
||||
<paragraph><location><page_8><loc_31><loc_32><loc_36><loc_33></location>usage,</paragraph>
|
||||
<paragraph><location><page_8><loc_31><loc_31><loc_39><loc_32></location>user_type</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_29><loc_26><loc_30></location>FROM</paragraph>
|
||||
<paragraph><location><page_8><loc_31><loc_29><loc_43><loc_30></location>function_usage</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_28><loc_26><loc_29></location>WHERE</paragraph>
|
||||
<paragraph><location><page_8><loc_31><loc_28><loc_54><loc_29></location>function_id='QIBM_DB_SECADM'</paragraph>
|
||||
<paragraph><location><page_8><loc_22><loc_26><loc_29><loc_27></location>ORDER BY</paragraph>
|
||||
<paragraph><location><page_8><loc_31><loc_26><loc_39><loc_27></location>user_name;</paragraph>
|
||||
<subtitle-level-1><location><page_8><loc_11><loc_20><loc_41><loc_22></location>2.2 Separation of duties</subtitle-level-1>
|
||||
<paragraph><location><page_8><loc_22><loc_10><loc_89><loc_18></location>Separation of duties helps businesses comply with industry regulations or organizational requirements and simplifies the management of authorities. Separation of duties is commonly used to prevent fraudulent activities or errors by a single person. It provides the ability for administrative functions to be divided across individuals without overlapping responsibilities, so that one user does not possess unlimited authority, such as with the *ALLOBJ authority.</paragraph>
|
||||
<paragraph><location><page_9><loc_22><loc_82><loc_89><loc_91></location>For example, assume that a business has assigned the duty to manage security on IBM i to Theresa. Before release IBM i 7.2, to grant privileges, Theresa had to have the same privileges Theresa was granting to others. Therefore, to grant *USE privileges to the PAYROLL table, Theresa had to have *OBJMGT and *USE authority (or a higher level of authority, such as *ALLOBJ). This requirement allowed Theresa to access the data in the PAYROLL table even though Theresa's job description was only to manage its security.</paragraph>
|
||||
<paragraph><location><page_9><loc_22><loc_82><loc_88><loc_91></location>For example, assume that a business has assigned the duty to manage security on IBM i to Theresa. Before release IBM i 7.2, to grant privileges, Theresa had to have the same privileges Theresa was granting to others. Therefore, to grant *USE privileges to the PAYROLL table, Theresa had to have *OBJMGT and *USE authority (or a higher level of authority, such as *ALLOBJ). This requirement allowed Theresa to access the data in the PAYROLL table even though Theresa's job description was only to manage its security.</paragraph>
|
||||
<paragraph><location><page_9><loc_22><loc_75><loc_89><loc_81></location>In IBM i 7.2, the QIBM_DB_SECADM function usage grants authorities, revokes authorities, changes ownership, or changes the primary group without giving access to the object or, in the case of a database table, to the data that is in the table or allowing other operations on the table.</paragraph>
|
||||
<paragraph><location><page_9><loc_22><loc_71><loc_88><loc_73></location>QIBM_DB_SECADM function usage can be granted only by a user with *SECADM special authority and can be given to a user or a group.</paragraph>
|
||||
<paragraph><location><page_9><loc_22><loc_65><loc_89><loc_69></location>QIBM_DB_SECADM also is responsible for administering RCAC, which restricts which rows a user is allowed to access in a table and whether a user is allowed to see information in certain columns of a table.</paragraph>
|
||||
<paragraph><location><page_9><loc_22><loc_57><loc_88><loc_63></location>A preferred practice is that the RCAC administrator has the QIBM_DB_SECADM function usage ID, but absolutely no other data privileges. The result is that the RCAC administrator can deploy and maintain the RCAC constructs, but cannot grant themselves unauthorized access to data itself.</paragraph>
|
||||
<paragraph><location><page_9><loc_22><loc_53><loc_89><loc_56></location>Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL authority to the different CL commands and DB2 for i tools.</paragraph>
|
||||
<paragraph><location><page_9><loc_22><loc_53><loc_88><loc_56></location>Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL authority to the different CL commands and DB2 for i tools.</paragraph>
|
||||
<table>
|
||||
<location><page_9><loc_11><loc_9><loc_89><loc_50></location>
|
||||
<caption>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
|
||||
<caption>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
|
||||
<row_0><col_0><body>User action</col_0><col_1><col_header>*JOBCTL</col_1><col_2><col_header>QIBM_DB_SECADM</col_2><col_3><col_header>QIBM_DB_SQLADM</col_3><col_4><col_header>QIBM_DB_SYSMON</col_4><col_5><col_header>No Authority</col_5></row_0>
|
||||
<row_1><col_0><row_header>SET CURRENT DEGREE (SQL statement)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_1>
|
||||
<row_2><col_0><row_header>CHGQRYA command targeting a different user's job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_2>
|
||||
<row_3><col_0><row_header>STRDBMON or ENDDBMON commands targeting a different user's job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_3>
|
||||
<row_4><col_0><row_header>STRDBMON or ENDDBMON commands targeting a job that matches the current user</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body>X</col_4><col_5><body>X</col_5></row_4>
|
||||
<row_1><col_0><row_header>SET CURRENT DEGREE (SQL statement)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_1>
|
||||
<row_2><col_0><row_header>CHGQRYA command targeting a different user's job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_2>
|
||||
<row_3><col_0><row_header>STRDBMON or ENDDBMON commands targeting a different user's job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_3>
|
||||
<row_4><col_0><row_header>STRDBMON or ENDDBMON commands targeting a job that matches the current user</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body>X</col_4><col_5><body>X</col_5></row_4>
|
||||
<row_5><col_0><row_header>QUSRJOBI() API format 900 or System i Navigator's SQL Details for Job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body>X</col_4><col_5><body></col_5></row_5>
|
||||
<row_6><col_0><row_header>Visual Explain within Run SQL scripts</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body>X</col_4><col_5><body>X</col_5></row_6>
|
||||
<row_7><col_0><row_header>Visual Explain outside of Run SQL scripts</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_7>
|
||||
@ -140,24 +144,24 @@
|
||||
<row_11><col_0><row_header>MODIFY PLAN CACHE PROPERTIES procedure (currently does not check authority)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_11>
|
||||
<row_12><col_0><row_header>CHANGE PLAN CACHE SIZE procedure (currently does not check authority)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_12>
|
||||
</table>
|
||||
<caption><location><page_9><loc_11><loc_51><loc_64><loc_52></location>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
|
||||
<caption><location><page_9><loc_11><loc_51><loc_64><loc_52></location>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
|
||||
<caption><location><page_10><loc_22><loc_88><loc_86><loc_91></location>The SQL CREATE PERMISSION statement that is shown in Figure 3-1 is used to define and initially enable or disable the row access rules.</caption>
|
||||
<figure>
|
||||
<location><page_10><loc_22><loc_48><loc_89><loc_86></location>
|
||||
<caption>Figure 3-1 CREATE PERMISSION SQL statement</caption>
|
||||
<caption>Figure 3-1 CREATE PERMISSION SQL statement</caption>
|
||||
</figure>
|
||||
<subtitle-level-1><location><page_10><loc_22><loc_43><loc_35><loc_44></location>Column mask</subtitle-level-1>
|
||||
<paragraph><location><page_10><loc_22><loc_37><loc_89><loc_43></location>A column mask is a database object that manifests a column value access control rule for a specific column in a specific table. It uses a CASE expression that describes what you see when you access the column. For example, a teller can see only the last four digits of a tax identification number.</paragraph>
|
||||
<paragraph><location><page_10><loc_22><loc_37><loc_88><loc_43></location>A column mask is a database object that manifests a column value access control rule for a specific column in a specific table. It uses a CASE expression that describes what you see when you access the column. For example, a teller can see only the last four digits of a tax identification number.</paragraph>
|
||||
<caption><location><page_11><loc_22><loc_90><loc_67><loc_91></location>Table 3-1 summarizes these special registers and their values.</caption>
|
||||
<table>
|
||||
<location><page_11><loc_22><loc_74><loc_89><loc_87></location>
|
||||
<caption>Table 3-1 Special registers and their corresponding values</caption>
|
||||
<caption>Table 3-1 Special registers and their corresponding values</caption>
|
||||
<row_0><col_0><col_header>Special register</col_0><col_1><col_header>Corresponding value</col_1></row_0>
|
||||
<row_1><col_0><body>USER or SESSION_USER</col_0><col_1><body>The effective user of the thread excluding adopted authority.</col_1></row_1>
|
||||
<row_2><col_0><body>CURRENT_USER</col_0><col_1><body>The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER.</col_1></row_2>
|
||||
<row_2><col_0><body>CURRENT_USER</col_0><col_1><body>The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER.</col_1></row_2>
|
||||
<row_3><col_0><body>SYSTEM_USER</col_0><col_1><body>The authorization ID that initiated the connection.</col_1></row_3>
|
||||
</table>
|
||||
<caption><location><page_11><loc_22><loc_87><loc_61><loc_88></location>Table 3-1 Special registers and their corresponding values</caption>
|
||||
<caption><location><page_11><loc_22><loc_87><loc_61><loc_88></location>Table 3-1 Special registers and their corresponding values</caption>
|
||||
<paragraph><location><page_11><loc_22><loc_70><loc_88><loc_73></location>Figure 3-5 shows the difference in the special register values when an adopted authority is used:</paragraph>
|
||||
<paragraph><location><page_11><loc_22><loc_68><loc_67><loc_69></location>- /SM590000 A user connects to the server using the user profile ALICE.</paragraph>
|
||||
<paragraph><location><page_11><loc_22><loc_66><loc_74><loc_67></location>- /SM590000 USER and CURRENT USER initially have the same value of ALICE.</paragraph>
|
||||
@ -166,15 +170,15 @@
|
||||
<paragraph><location><page_11><loc_22><loc_53><loc_89><loc_56></location>- /SM590000 When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.</paragraph>
|
||||
<figure>
|
||||
<location><page_11><loc_22><loc_25><loc_49><loc_51></location>
|
||||
<caption>Figure 3-5 Special registers and adopted authority</caption>
|
||||
<caption>Figure 3-5 Special registers and adopted authority</caption>
|
||||
</figure>
|
||||
<subtitle-level-1><location><page_11><loc_11><loc_20><loc_40><loc_21></location>3.2.2 Built-in global variables</subtitle-level-1>
|
||||
<paragraph><location><page_11><loc_22><loc_15><loc_85><loc_18></location>Built-in global variables are provided with the database manager and are used in SQL statements to retrieve scalar values that are associated with the variables.</paragraph>
|
||||
<subtitle-level-1><location><page_11><loc_11><loc_20><loc_40><loc_21></location>3.2.2 Built-in global variables</subtitle-level-1>
|
||||
<paragraph><location><page_11><loc_22><loc_15><loc_84><loc_18></location>Built-in global variables are provided with the database manager and are used in SQL statements to retrieve scalar values that are associated with the variables.</paragraph>
|
||||
<paragraph><location><page_11><loc_22><loc_9><loc_87><loc_13></location>IBM DB2 for i supports nine different built-in global variables that are read only and maintained by the system. These global variables can be used to identify attributes of the database connection and used as part of the RCAC logic.</paragraph>
|
||||
<paragraph><location><page_12><loc_22><loc_90><loc_56><loc_91></location>Table 3-2 lists the nine built-in global variables.</paragraph>
|
||||
<table>
|
||||
<location><page_12><loc_10><loc_63><loc_90><loc_87></location>
|
||||
<caption>Table 3-2 Built-in global variables</caption>
|
||||
<caption>Table 3-2 Built-in global variables</caption>
|
||||
<row_0><col_0><col_header>Global variable</col_0><col_1><col_header>Type</col_1><col_2><col_header>Description</col_2></row_0>
|
||||
<row_1><col_0><body>CLIENT_HOST</col_0><col_1><body>VARCHAR(255)</col_1><col_2><body>Host name of the current client as returned by the system</col_2></row_1>
|
||||
<row_2><col_0><body>CLIENT_IPADDR</col_0><col_1><body>VARCHAR(128)</col_1><col_2><body>IP address of the current client as returned by the system</col_2></row_2>
|
||||
@ -186,63 +190,70 @@
|
||||
<row_8><col_0><body>ROUTINE_SPECIFIC_NAME</col_0><col_1><body>VARCHAR(128)</col_1><col_2><body>Name of the currently running routine</col_2></row_8>
|
||||
<row_9><col_0><body>ROUTINE_TYPE</col_0><col_1><body>CHAR(1)</col_1><col_2><body>Type of the currently running routine</col_2></row_9>
|
||||
</table>
|
||||
<caption><location><page_12><loc_11><loc_87><loc_33><loc_88></location>Table 3-2 Built-in global variables</caption>
|
||||
<subtitle-level-1><location><page_12><loc_11><loc_57><loc_63><loc_59></location>3.3 VERIFY_GROUP_FOR_USER function</subtitle-level-1>
|
||||
<caption><location><page_12><loc_11><loc_87><loc_33><loc_88></location>Table 3-2 Built-in global variables</caption>
|
||||
<subtitle-level-1><location><page_12><loc_11><loc_57><loc_63><loc_59></location>3.3 VERIFY_GROUP_FOR_USER function</subtitle-level-1>
|
||||
<paragraph><location><page_12><loc_22><loc_45><loc_89><loc_55></location>The VERIFY_GROUP_FOR_USER function was added in IBM i 7.2. Although it is primarily intended for use with RCAC permissions and masks, it can be used in other SQL statements. The first parameter must be one of these three special registers: SESSION_USER, USER, or CURRENT_USER. The second and subsequent parameters are a list of user or group profiles. Each of these values must be 1 - 10 characters in length. These values are not validated for their existence, which means that you can specify the names of user profiles that do not exist without receiving any kind of error.</paragraph>
|
||||
<paragraph><location><page_12><loc_22><loc_39><loc_89><loc_43></location>If a special register value is in the list of user profiles or it is a member of a group profile included in the list, the function returns a long integer value of 1. Otherwise, it returns a value of 0. It never returns the null value.</paragraph>
|
||||
<paragraph><location><page_12><loc_22><loc_36><loc_75><loc_38></location>Here is an example of using the VERIFY_GROUP_FOR_USER function:</paragraph>
|
||||
<paragraph><location><page_12><loc_22><loc_34><loc_66><loc_35></location>- 1. There are user profiles for MGR, JANE, JUDY, and TONY.</paragraph>
|
||||
<paragraph><location><page_12><loc_22><loc_32><loc_65><loc_33></location>- 2. The user profile JANE specifies a group profile of MGR.</paragraph>
|
||||
<paragraph><location><page_12><loc_22><loc_28><loc_88><loc_31></location>- 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:</paragraph>
|
||||
<paragraph><location><page_12><loc_25><loc_19><loc_74><loc_27></location>VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', 'STEVE') The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')</paragraph>
|
||||
<paragraph><location><page_12><loc_22><loc_28><loc_87><loc_31></location>- 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:</paragraph>
|
||||
<paragraph><location><page_12><loc_25><loc_19><loc_67><loc_27></location>VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')</paragraph>
|
||||
<paragraph><location><page_12><loc_67><loc_23><loc_74><loc_24></location>'STEVE')</paragraph>
|
||||
<paragraph><location><page_13><loc_22><loc_90><loc_27><loc_91></location>RETURN</paragraph>
|
||||
<paragraph><location><page_13><loc_22><loc_88><loc_26><loc_89></location>CASE</paragraph>
|
||||
<paragraph><location><page_13><loc_22><loc_67><loc_85><loc_88></location>WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;</paragraph>
|
||||
<paragraph><location><page_13><loc_23><loc_67><loc_85><loc_88></location>WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;</paragraph>
|
||||
<paragraph><location><page_13><loc_22><loc_63><loc_89><loc_65></location>- 2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:</paragraph>
|
||||
<paragraph><location><page_13><loc_25><loc_60><loc_77><loc_62></location>- -Human Resources can see the unmasked TAX_ID of the employees.</paragraph>
|
||||
<paragraph><location><page_13><loc_25><loc_58><loc_66><loc_59></location>- -Employees can see only their own unmasked TAX_ID.</paragraph>
|
||||
<paragraph><location><page_13><loc_25><loc_55><loc_89><loc_57></location>- -Managers see a masked version of TAX_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).</paragraph>
|
||||
<paragraph><location><page_13><loc_25><loc_52><loc_87><loc_54></location>- -Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.</paragraph>
|
||||
<paragraph><location><page_13><loc_25><loc_50><loc_87><loc_51></location>- To implement this column mask, run the SQL statement that is shown in Example 3-9.</paragraph>
|
||||
<paragraph><location><page_13><loc_22><loc_14><loc_86><loc_46></location>CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;</paragraph>
|
||||
<caption><location><page_13><loc_22><loc_48><loc_58><loc_49></location>Example 3-9 Creating a mask on the TAX_ID column</caption>
|
||||
<paragraph><location><page_13><loc_22><loc_14><loc_86><loc_46></location>CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;</paragraph>
|
||||
<caption><location><page_13><loc_22><loc_48><loc_58><loc_49></location>Example 3-9 Creating a mask on the TAX_ID column</caption>
|
||||
<paragraph><location><page_14><loc_22><loc_90><loc_74><loc_91></location>- 3. Figure 3-10 shows the masks that are created in the HR_SCHEMA.</paragraph>
|
||||
<figure>
|
||||
<location><page_14><loc_10><loc_79><loc_89><loc_88></location>
|
||||
<caption>Figure 3-10 Column masks shown in System i Navigator</caption>
|
||||
<caption>Figure 3-10 Column masks shown in System i Navigator</caption>
|
||||
</figure>
|
||||
<subtitle-level-1><location><page_14><loc_11><loc_73><loc_33><loc_74></location>3.6.6 Activating RCAC</subtitle-level-1>
|
||||
<subtitle-level-1><location><page_14><loc_11><loc_73><loc_33><loc_74></location>3.6.6 Activating RCAC</subtitle-level-1>
|
||||
<paragraph><location><page_14><loc_22><loc_67><loc_89><loc_71></location>Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_65><loc_67><loc_66></location>- 1. Run the SQL statements that are shown in Example 3-10.</paragraph>
|
||||
<subtitle-level-1><location><page_14><loc_22><loc_62><loc_61><loc_63></location>Example 3-10 Activating RCAC on the EMPLOYEES table</subtitle-level-1>
|
||||
<paragraph><location><page_14><loc_22><loc_60><loc_62><loc_61></location>- /* Active Row Access Control (permissions) */</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_58><loc_62><loc_59></location>- /* Active Column Access Control (masks) */</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_57><loc_48><loc_58></location>ALTER TABLE HR_SCHEMA.EMPLOYEES</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_55><loc_44><loc_56></location>ACTIVATE ROW ACCESS CONTROL</paragraph>
|
||||
<subtitle-level-1><location><page_14><loc_22><loc_62><loc_61><loc_63></location>Example 3-10 Activating RCAC on the EMPLOYEES table</subtitle-level-1>
|
||||
<paragraph><location><page_14><loc_22><loc_60><loc_58><loc_61></location>- /* Active Row Access Control (permissions)</paragraph>
|
||||
<paragraph><location><page_14><loc_60><loc_60><loc_62><loc_61></location>*/</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_58><loc_56><loc_59></location>- /* Active Column Access Control (masks)</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_57><loc_26><loc_58></location>ALTER</paragraph>
|
||||
<paragraph><location><page_14><loc_27><loc_57><loc_48><loc_58></location>TABLE HR_SCHEMA.EMPLOYEES</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_55><loc_32><loc_56></location>ACTIVATE ROW</paragraph>
|
||||
<paragraph><location><page_14><loc_33><loc_55><loc_38><loc_56></location>ACCESS</paragraph>
|
||||
<paragraph><location><page_14><loc_39><loc_55><loc_44><loc_56></location>CONTROL</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_54><loc_48><loc_55></location>ACTIVATE COLUMN ACCESS CONTROL;</paragraph>
|
||||
<paragraph><location><page_14><loc_22><loc_48><loc_88><loc_52></location>- 2. Look at the definition of the EMPLOYEE table, as shown in Figure 3-11. To do this, from the main navigation pane of System i Navigator, click Schemas HR_SCHEMA Tables , right-click the EMPLOYEES table, and click Definition .</paragraph>
|
||||
<figure>
|
||||
<location><page_14><loc_10><loc_18><loc_87><loc_46></location>
|
||||
<caption>Figure 3-11 Selecting the EMPLOYEES table from System i Navigator</caption>
|
||||
<caption>Figure 3-11 Selecting the EMPLOYEES table from System i Navigator</caption>
|
||||
</figure>
|
||||
<paragraph><location><page_14><loc_60><loc_58><loc_62><loc_59></location>*/</paragraph>
|
||||
<paragraph><location><page_15><loc_22><loc_87><loc_84><loc_91></location>- 2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.</paragraph>
|
||||
<paragraph><location><page_15><loc_22><loc_32><loc_89><loc_36></location>- 3. Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.</paragraph>
|
||||
<figure>
|
||||
<location><page_15><loc_22><loc_40><loc_89><loc_85></location>
|
||||
<caption>Figure 4-68 Visual Explain with RCAC enabled</caption>
|
||||
<caption>Figure 4-68 Visual Explain with RCAC enabled</caption>
|
||||
</figure>
|
||||
<figure>
|
||||
<location><page_15><loc_11><loc_16><loc_83><loc_30></location>
|
||||
<caption>Figure 4-69 Index advice with no RCAC</caption>
|
||||
<caption>Figure 4-69 Index advice with no RCAC</caption>
|
||||
</figure>
|
||||
<paragraph><location><page_16><loc_11><loc_11><loc_82><loc_91></location>THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;</paragraph>
|
||||
<paragraph><location><page_16><loc_11><loc_11><loc_80><loc_91></location>THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;</paragraph>
|
||||
<paragraph><location><page_16><loc_80><loc_29><loc_81><loc_30></location>C</paragraph>
|
||||
<paragraph><location><page_18><loc_47><loc_94><loc_68><loc_96></location>Back cover</paragraph>
|
||||
<subtitle-level-1><location><page_18><loc_4><loc_82><loc_73><loc_91></location>Row and Column Access Control Support in IBM DB2 for i</subtitle-level-1>
|
||||
<paragraph><location><page_18><loc_4><loc_66><loc_21><loc_69></location>Implement roles and separation of duties</paragraph>
|
||||
<paragraph><location><page_18><loc_4><loc_59><loc_20><loc_64></location>Leverage row permissions on the database</paragraph>
|
||||
<paragraph><location><page_18><loc_25><loc_59><loc_68><loc_69></location>This IBM Redpaper publication provides information about the IBM i 7.2 feature of IBM DB2 for i Row and Column Access Control (RCAC). It offers a broad description of the function and advantages of controlling access to data in a comprehensive and transparent way. This publication helps you understand the capabilities of RCAC and provides examples of defining, creating, and implementing the row permissions and column masks in a relational database environment.</paragraph>
|
||||
<paragraph><location><page_18><loc_4><loc_52><loc_20><loc_57></location>Protect columns by defining column masks</paragraph>
|
||||
<paragraph><location><page_18><loc_25><loc_51><loc_68><loc_58></location>This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.</paragraph>
|
||||
<subtitle-level-1><location><page_18><loc_4><loc_82><loc_72><loc_91></location>RowandColumnAccessControl Support in IBM DB2 for i</subtitle-level-1>
|
||||
<paragraph><location><page_18><loc_4><loc_66><loc_20><loc_69></location>Implement roles and separation of duties</paragraph>
|
||||
<paragraph><location><page_18><loc_4><loc_59><loc_19><loc_64></location>Leverage row permissions on the database</paragraph>
|
||||
<paragraph><location><page_18><loc_25><loc_59><loc_67><loc_69></location>This IBM Redpaper publication provides information about the IBM i 7.2 feature of IBM DB2 for i Row and Column Access Control (RCAC). It offers a broad description of the function and advantages of controlling access to data in a comprehensive and transparent way. This publication helps you understand the capabilities of RCAC and provides examples of defining, creating, and implementing the row permissions and column masks in a relational database environment.</paragraph>
|
||||
<paragraph><location><page_18><loc_4><loc_52><loc_19><loc_57></location>Protect columns by defining column masks</paragraph>
|
||||
<paragraph><location><page_18><loc_25><loc_51><loc_67><loc_58></location>This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.</paragraph>
|
||||
<figure>
|
||||
<location><page_18><loc_79><loc_93><loc_93><loc_97></location>
|
||||
</figure>
|
||||
|
File diff suppressed because one or more lines are too long
@ -2,7 +2,7 @@ Front cover
|
||||
|
||||
<!-- image -->
|
||||
|
||||
## Row and Column Access Control Support in IBM DB2 for i
|
||||
## RowandColumnAccessControl Support in IBM DB2 for i
|
||||
|
||||
<!-- image -->
|
||||
|
||||
@ -20,7 +20,7 @@ Solution Brief IBM Systems Lab Services and Training
|
||||
|
||||
- /g115/g3 /g40/g81/g75/g68/g81/g70/g72/g3 /g87/g75/g72/g3 /g83/g72/g85/g73/g82/g85/g80/g68/g81/g70/g72/g3 /g82/g73/g3 /g92/g82/g88/g85/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g82/g83/g72/g85/g68/g87/g76/g82/g81/g86
|
||||
|
||||
- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85 /g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86
|
||||
- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85/g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86
|
||||
|
||||
- /g115/g3 /g53/g72/g79/g92/g3 /g82/g81/g3 /g44/g37/g48/g3 /g72/g91/g83/g72/g85/g87/g3 /g70/g82/g81/g86/g88/g79/g87/g76/g81/g74/g15/g3 /g86/g78/g76/g79/g79/g86/g3 /g86/g75/g68/g85/g76/g81/g74/g3 /g68/g81/g71/g3 /g85/g72/g81/g82/g90/g81/g3 /g86/g72/g85/g89/g76/g70/g72/g86
|
||||
|
||||
@ -52,7 +52,7 @@ Global CoE engagements cover topics including:
|
||||
|
||||
- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Business intelligence and analytics
|
||||
|
||||
- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> DB2 Web Query
|
||||
- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> DB2Web Query
|
||||
|
||||
- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Query/400 modernization for better reporting and analysis capabilities
|
||||
|
||||
@ -90,7 +90,7 @@ Chapter 1.
|
||||
|
||||
## Securing and protecting IBM DB2 data
|
||||
|
||||
Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.
|
||||
Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.
|
||||
|
||||
Businesses must make a serious effort to secure their data and recognize that securing information assets is a cost of doing business. In many parts of the world and in many industries, securing the data is required by law and subject to audits. Data security is no longer an option; it is a requirement.
|
||||
|
||||
@ -102,7 +102,7 @@ This chapter describes how you can secure and protect data in DB2 for i. The fol
|
||||
|
||||
- /SM590000 DB2 for i security controls
|
||||
|
||||
## 1.1 Security fundamentals
|
||||
## 1.1 Security fundamentals
|
||||
|
||||
Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:
|
||||
|
||||
@ -116,7 +116,7 @@ A security policy is what defines whether the system and its settings are secure
|
||||
|
||||
With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.
|
||||
|
||||
## 1.2 Current state of IBM i security
|
||||
## 1.2 Current state of IBM i security
|
||||
|
||||
Because of the inherently secure nature of IBM i, many clients rely on the default system settings to protect their business data that is stored in DB2 for i. In most cases, this means no data protection because the default setting for the Create default public authority (QCRTAUT) system value is *CHANGE.
|
||||
|
||||
@ -126,7 +126,7 @@ Traditionally, IBM i applications have employed menu-based security to counterac
|
||||
|
||||
Many businesses are trying to limit data access to a need-to-know basis. This security goal means that users should be given access only to the minimum set of data that is required to perform their job. Often, users with object-level access are given access to row and column values that are beyond what their business task requires because that object-level security provides an all-or-nothing solution. For example, object-level controls allow a manager to access data about all employees. Most security policies limit a manager to accessing data only for the employees that they manage.
|
||||
|
||||
## 1.3.1 Existing row and column control
|
||||
## 1.3.1 Existing row and column control
|
||||
|
||||
Some IBM i clients have tried augmenting the all-or-nothing object-level security with SQL views (or logical files) and application logic, as shown in Figure 1-2. However, application-based logic is easy to bypass with all of the different data access interfaces that are provided by the IBM i operating system, such as Open Database Connectivity (ODBC) and System i Navigator.
|
||||
|
||||
@ -134,10 +134,10 @@ Using SQL views to limit access to a subset of the data in a table also has its
|
||||
|
||||
Even if you are willing to live with these performance and management issues, a user with *ALLOBJ access still can directly access all of the data in the underlying DB2 table and easily bypass the security controls that are built into an SQL view.
|
||||
|
||||
Figure 1-2 Existing row and column controls
|
||||
Figure 1-2 Existing row and column controls
|
||||
<!-- image -->
|
||||
|
||||
## 2.1.6 Change Function Usage CL command
|
||||
## 2.1.6 Change Function Usage CL command
|
||||
|
||||
The following CL commands can be used to work with, display, or change function usage IDs:
|
||||
|
||||
@ -151,24 +151,26 @@ For example, the following CHGFCNUSG command shows granting authorization to use
|
||||
|
||||
CHGFCNUSG FCNID(QIBM_DB_SECADM) USER(HBEDOYA) USAGE(*ALLOWED)
|
||||
|
||||
## 2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view
|
||||
## 2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view
|
||||
|
||||
The FUNCTION_USAGE view contains function usage configuration details. Table 2-1 describes the columns in the FUNCTION_USAGE view.
|
||||
|
||||
Table 2-1 FUNCTION_USAGE view
|
||||
Table 2-1 FUNCTION_USAGE view
|
||||
|
||||
| Column name | Data type | Description |
|
||||
|---------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| FUNCTION_ID | VARCHAR(30) | ID of the function. |
|
||||
| USER_NAME | VARCHAR(10) | Name of the user profile that has a usage setting for this function. |
|
||||
| USER_NAME | VARCHAR(10) | Name of the user profile that has a usage setting for this function. |
|
||||
| USAGE | VARCHAR(7) | Usage setting: /SM590000 ALLOWED: The user profile is allowed to use the function. /SM590000 DENIED: The user profile is not allowed to use the function. |
|
||||
| USER_TYPE | VARCHAR(5) | Type of user profile: /SM590000 USER: The user profile is a user. /SM590000 GROUP: The user profile is a group. |
|
||||
|
||||
To discover who has authorization to define and manage RCAC, you can use the query that is shown in Example 2-1.
|
||||
|
||||
Example 2-1 Query to determine who has authority to define and manage RCAC
|
||||
Example 2-1 Query to determine who has authority to define and manage RCAC
|
||||
|
||||
SELECT function_id,
|
||||
SELECT
|
||||
|
||||
function_id,
|
||||
|
||||
user_name,
|
||||
|
||||
@ -176,13 +178,19 @@ usage,
|
||||
|
||||
user_type
|
||||
|
||||
FROM function_usage
|
||||
FROM
|
||||
|
||||
WHERE function_id='QIBM_DB_SECADM'
|
||||
function_usage
|
||||
|
||||
ORDER BY user_name;
|
||||
WHERE
|
||||
|
||||
## 2.2 Separation of duties
|
||||
function_id='QIBM_DB_SECADM'
|
||||
|
||||
ORDER BY
|
||||
|
||||
user_name;
|
||||
|
||||
## 2.2 Separation of duties
|
||||
|
||||
Separation of duties helps businesses comply with industry regulations or organizational requirements and simplifies the management of authorities. Separation of duties is commonly used to prevent fraudulent activities or errors by a single person. It provides the ability for administrative functions to be divided across individuals without overlapping responsibilities, so that one user does not possess unlimited authority, such as with the *ALLOBJ authority.
|
||||
|
||||
@ -198,26 +206,26 @@ A preferred practice is that the RCAC administrator has the QIBM_DB_SECADM funct
|
||||
|
||||
Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL authority to the different CL commands and DB2 for i tools.
|
||||
|
||||
Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority
|
||||
Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority
|
||||
|
||||
| User action | *JOBCTL | QIBM_DB_SECADM | QIBM_DB_SQLADM | QIBM_DB_SYSMON | No Authority |
|
||||
|--------------------------------------------------------------------------------|-----------|------------------|------------------|------------------|----------------|
|
||||
| SET CURRENT DEGREE (SQL statement) | X | | X | | |
|
||||
| CHGQRYA command targeting a different user's job | X | | X | | |
|
||||
| STRDBMON or ENDDBMON commands targeting a different user's job | X | | X | | |
|
||||
| STRDBMON or ENDDBMON commands targeting a job that matches the current user | X | | X | X | X |
|
||||
| QUSRJOBI() API format 900 or System i Navigator's SQL Details for Job | X | | X | X | |
|
||||
| Visual Explain within Run SQL scripts | X | | X | X | X |
|
||||
| Visual Explain outside of Run SQL scripts | X | | X | | |
|
||||
| ANALYZE PLAN CACHE procedure | X | | X | | |
|
||||
| DUMP PLAN CACHE procedure | X | | X | | |
|
||||
| MODIFY PLAN CACHE procedure | X | | X | | |
|
||||
| MODIFY PLAN CACHE PROPERTIES procedure (currently does not check authority) | X | | X | | |
|
||||
| CHANGE PLAN CACHE SIZE procedure (currently does not check authority) | X | | X | | |
|
||||
| User action | *JOBCTL | QIBM_DB_SECADM | QIBM_DB_SQLADM | QIBM_DB_SYSMON | No Authority |
|
||||
|-----------------------------------------------------------------------------|-----------|------------------|------------------|------------------|----------------|
|
||||
| SET CURRENT DEGREE (SQL statement) | X | | X | | |
|
||||
| CHGQRYA command targeting a different user's job | X | | X | | |
|
||||
| STRDBMON or ENDDBMON commands targeting a different user's job | X | | X | | |
|
||||
| STRDBMON or ENDDBMON commands targeting a job that matches the current user | X | | X | X | X |
|
||||
| QUSRJOBI() API format 900 or System i Navigator's SQL Details for Job | X | | X | X | |
|
||||
| Visual Explain within Run SQL scripts | X | | X | X | X |
|
||||
| Visual Explain outside of Run SQL scripts | X | | X | | |
|
||||
| ANALYZE PLAN CACHE procedure | X | | X | | |
|
||||
| DUMP PLAN CACHE procedure | X | | X | | |
|
||||
| MODIFY PLAN CACHE procedure | X | | X | | |
|
||||
| MODIFY PLAN CACHE PROPERTIES procedure (currently does not check authority) | X | | X | | |
|
||||
| CHANGE PLAN CACHE SIZE procedure (currently does not check authority) | X | | X | | |
|
||||
|
||||
The SQL CREATE PERMISSION statement that is shown in Figure 3-1 is used to define and initially enable or disable the row access rules.
|
||||
|
||||
Figure 3-1 CREATE PERMISSION SQL statement
|
||||
Figure 3-1 CREATE PERMISSION SQL statement
|
||||
<!-- image -->
|
||||
|
||||
## Column mask
|
||||
@ -226,13 +234,13 @@ A column mask is a database object that manifests a column value access control
|
||||
|
||||
Table 3-1 summarizes these special registers and their values.
|
||||
|
||||
Table 3-1 Special registers and their corresponding values
|
||||
Table 3-1 Special registers and their corresponding values
|
||||
|
||||
| Special register | Corresponding value |
|
||||
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| USER or SESSION_USER | The effective user of the thread excluding adopted authority. |
|
||||
| CURRENT_USER | The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER. |
|
||||
| SYSTEM_USER | The authorization ID that initiated the connection. |
|
||||
| Special register | Corresponding value |
|
||||
|----------------------|--------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| USER or SESSION_USER | The effective user of the thread excluding adopted authority. |
|
||||
| CURRENT_USER | The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER. |
|
||||
| SYSTEM_USER | The authorization ID that initiated the connection. |
|
||||
|
||||
Figure 3-5 shows the difference in the special register values when an adopted authority is used:
|
||||
|
||||
@ -246,10 +254,10 @@ Figure 3-5 shows the difference in the special register values when an adopted a
|
||||
|
||||
- /SM590000 When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.
|
||||
|
||||
Figure 3-5 Special registers and adopted authority
|
||||
Figure 3-5 Special registers and adopted authority
|
||||
<!-- image -->
|
||||
|
||||
## 3.2.2 Built-in global variables
|
||||
## 3.2.2 Built-in global variables
|
||||
|
||||
Built-in global variables are provided with the database manager and are used in SQL statements to retrieve scalar values that are associated with the variables.
|
||||
|
||||
@ -257,7 +265,7 @@ IBM DB2 for i supports nine different built-in global variables that are read on
|
||||
|
||||
Table 3-2 lists the nine built-in global variables.
|
||||
|
||||
Table 3-2 Built-in global variables
|
||||
Table 3-2 Built-in global variables
|
||||
|
||||
| Global variable | Type | Description |
|
||||
|-----------------------|--------------|----------------------------------------------------------------|
|
||||
@ -271,7 +279,7 @@ Table 3-2 Built-in global variables
|
||||
| ROUTINE_SPECIFIC_NAME | VARCHAR(128) | Name of the currently running routine |
|
||||
| ROUTINE_TYPE | CHAR(1) | Type of the currently running routine |
|
||||
|
||||
## 3.3 VERIFY_GROUP_FOR_USER function
|
||||
## 3.3 VERIFY_GROUP_FOR_USER function
|
||||
|
||||
The VERIFY_GROUP_FOR_USER function was added in IBM i 7.2. Although it is primarily intended for use with RCAC permissions and masks, it can be used in other SQL statements. The first parameter must be one of these three special registers: SESSION_USER, USER, or CURRENT_USER. The second and subsequent parameters are a list of user or group profiles. Each of these values must be 1 - 10 characters in length. These values are not validated for their existence, which means that you can specify the names of user profiles that do not exist without receiving any kind of error.
|
||||
|
||||
@ -285,13 +293,15 @@ Here is an example of using the VERIFY_GROUP_FOR_USER function:
|
||||
|
||||
- 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:
|
||||
|
||||
VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', 'STEVE') The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')
|
||||
VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')
|
||||
|
||||
'STEVE')
|
||||
|
||||
RETURN
|
||||
|
||||
CASE
|
||||
|
||||
WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;
|
||||
WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;
|
||||
|
||||
- 2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:
|
||||
|
||||
@ -305,53 +315,65 @@ WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . D
|
||||
|
||||
- To implement this column mask, run the SQL statement that is shown in Example 3-9.
|
||||
|
||||
CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;
|
||||
CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;
|
||||
|
||||
Example 3-9 Creating a mask on the TAX_ID column
|
||||
Example 3-9 Creating a mask on the TAX_ID column
|
||||
|
||||
- 3. Figure 3-10 shows the masks that are created in the HR_SCHEMA.
|
||||
|
||||
Figure 3-10 Column masks shown in System i Navigator
|
||||
Figure 3-10 Column masks shown in System i Navigator
|
||||
<!-- image -->
|
||||
|
||||
## 3.6.6 Activating RCAC
|
||||
## 3.6.6 Activating RCAC
|
||||
|
||||
Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:
|
||||
|
||||
- 1. Run the SQL statements that are shown in Example 3-10.
|
||||
|
||||
## Example 3-10 Activating RCAC on the EMPLOYEES table
|
||||
## Example 3-10 Activating RCAC on the EMPLOYEES table
|
||||
|
||||
- /* Active Row Access Control (permissions) */
|
||||
- /* Active Row Access Control (permissions)
|
||||
|
||||
- /* Active Column Access Control (masks) */
|
||||
*/
|
||||
|
||||
ALTER TABLE HR_SCHEMA.EMPLOYEES
|
||||
- /* Active Column Access Control (masks)
|
||||
|
||||
ACTIVATE ROW ACCESS CONTROL
|
||||
ALTER
|
||||
|
||||
TABLE HR_SCHEMA.EMPLOYEES
|
||||
|
||||
ACTIVATE ROW
|
||||
|
||||
ACCESS
|
||||
|
||||
CONTROL
|
||||
|
||||
ACTIVATE COLUMN ACCESS CONTROL;
|
||||
|
||||
- 2. Look at the definition of the EMPLOYEE table, as shown in Figure 3-11. To do this, from the main navigation pane of System i Navigator, click Schemas HR_SCHEMA Tables , right-click the EMPLOYEES table, and click Definition .
|
||||
|
||||
Figure 3-11 Selecting the EMPLOYEES table from System i Navigator
|
||||
Figure 3-11 Selecting the EMPLOYEES table from System i Navigator
|
||||
<!-- image -->
|
||||
|
||||
*/
|
||||
|
||||
- 2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.
|
||||
|
||||
- 3. Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.
|
||||
|
||||
Figure 4-68 Visual Explain with RCAC enabled
|
||||
Figure 4-68 Visual Explain with RCAC enabled
|
||||
<!-- image -->
|
||||
|
||||
Figure 4-69 Index advice with no RCAC
|
||||
Figure 4-69 Index advice with no RCAC
|
||||
<!-- image -->
|
||||
|
||||
THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;
|
||||
THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;
|
||||
|
||||
C
|
||||
|
||||
Back cover
|
||||
|
||||
## Row and Column Access Control Support in IBM DB2 for i
|
||||
## RowandColumnAccessControl Support in IBM DB2 for i
|
||||
|
||||
Implement roles and separation of duties
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,7 +1,7 @@
|
||||
<document>
|
||||
<subtitle-level-1><location><page_1><loc_37><loc_89><loc_85><loc_90></location>تحسين الإنتاجية وحل المشكلات من خلال البرمجة بلغة R و Python</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_15><loc_80><loc_85><loc_87></location>تعتبر البرمجة بلغة R و Python من الأدوات القوية التي يمكن أن تعزز الإنتاجية وتساعد في إيجاد حلول فعالة للمشكلات. يمتلك كل من R و Python ميزات فريدة تجعلها مثالية لتحليل البيانات، مما يسهل على المحللين والعلماء إجراء تحليلات معقدة بطريقة سريعة وفعالة. إذا كان لديك عقلية تحليلية، فإن استخدام هذه اللغات يمكن أن يسهم بشكل كبير في تحسين نتائج العمل .</paragraph>
|
||||
<paragraph><location><page_1><loc_16><loc_72><loc_85><loc_78></location>عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .</paragraph>
|
||||
<paragraph><location><page_1><loc_15><loc_63><loc_85><loc_69></location>علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم البياني والتحليل الإ حصائي، مما يجعلها مثالية للباحثين والمحللين .</paragraph>
|
||||
<paragraph><location><page_1><loc_17><loc_72><loc_85><loc_78></location>عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .</paragraph>
|
||||
<paragraph><location><page_1><loc_16><loc_63><loc_85><loc_69></location>علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم الإ البياني والتحليل حصائي، مما يجعلها مثالية للباحثين والمحللين .</paragraph>
|
||||
<paragraph><location><page_1><loc_16><loc_56><loc_85><loc_61></location>في النهاية، يمكن أن تؤدي البرمجة بلغة R و Python مع عقلية تحليلية إلى تحسين الإنتاجية وتوفير حلول مبتكرة للمشكلات المعقدة. إن القدرة على تحليل البيانات بشكل فعال وتطبيق الأساليب البرمجية المناسبة يمكن أن تكون له ا تأثيرات إيجابية بعيدة المدى على الأداء الشخصي والمهني .</paragraph>
|
||||
</document>
|
File diff suppressed because one or more lines are too long
@ -2,8 +2,8 @@
|
||||
|
||||
تعتبر البرمجة بلغة R و Python من الأدوات القوية التي يمكن أن تعزز الإنتاجية وتساعد في إيجاد حلول فعالة للمشكلات. يمتلك كل من R و Python ميزات فريدة تجعلها مثالية لتحليل البيانات، مما يسهل على المحللين والعلماء إجراء تحليلات معقدة بطريقة سريعة وفعالة. إذا كان لديك عقلية تحليلية، فإن استخدام هذه اللغات يمكن أن يسهم بشكل كبير في تحسين نتائج العمل .
|
||||
|
||||
عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .
|
||||
عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .
|
||||
|
||||
علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم البياني والتحليل الإ حصائي، مما يجعلها مثالية للباحثين والمحللين .
|
||||
علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم الإ البياني والتحليل حصائي، مما يجعلها مثالية للباحثين والمحللين .
|
||||
|
||||
في النهاية، يمكن أن تؤدي البرمجة بلغة R و Python مع عقلية تحليلية إلى تحسين الإنتاجية وتوفير حلول مبتكرة للمشكلات المعقدة. إن القدرة على تحليل البيانات بشكل فعال وتطبيق الأساليب البرمجية المناسبة يمكن أن تكون له ا تأثيرات إيجابية بعيدة المدى على الأداء الشخصي والمهني .
|
File diff suppressed because one or more lines are too long
@ -1,9 +1,9 @@
|
||||
<document>
|
||||
<paragraph><location><page_1><loc_8><loc_3><loc_10><loc_4></location>11</paragraph>
|
||||
<paragraph><location><page_1><loc_11><loc_50><loc_73><loc_75></location>وعليه، فإن الحكومة المصرية تضع صوو عييهاوخ لوال المر اوة الم باوة تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح يق يدد من الأ هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح يووق معوودحت نمووو قويوووة ومسووولدامة و وووخماة فوووا عذاوووف ال لخيوووخت، و ووو ا الح وووخ ياوووى محوددات الأمون ال ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو ة، ومواصوواة واووود تلوووير الماووخر ة السيخسووية، واسوولمرار ملخبعووة ما و و خت الأمووون واحسووول رار ومكخفحوووة الإرهوووخ ، تلووووير ما وووخت ال خفوووة والوووويا الوووو،ها، والبلوووخ الوووديها المعلووودل ياوووى الهحوووو الووو ي يرسووو م وووخهيل الموا،هة والسام المجلمعا .</paragraph>
|
||||
<paragraph><location><page_1><loc_13><loc_45><loc_74><loc_48></location>ووف ًخ لمخ سبق، يسولاد برنوخما الحكوموة المصورية لوال ال لور ( 2024 -2026 ) تح يق عربعة عهدا اسلراتيجية رئيسة، وها ياى الهحو الآتا :</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_37><loc_73><loc_40></location>مايــــــــة اممــــــــن القومي المصـر بنــــاء ا نســــا المصــــــــــــــــــــر بنـــــاء ا تصـــــاع تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي</paragraph>
|
||||
<paragraph><location><page_1><loc_11><loc_23><loc_73><loc_31></location>تجدر الإ خر إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ د باوكل رئووويس ياوووى مسووولادفخت ر يوووة مصووور 2023 ، وتوصووويخت واسوووخت الحووووار الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا خت الايك ايوة، ومبلاف احسلراتيجيخت الو،هية .</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_50><loc_73><loc_75></location>وعليه، اوة الم عييهاوخ لوال المر فإن الحكومة المصرية تضع صوو باوة يق يدد من الأ تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، يووق معوودحت نمووو لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح ياوووى وووخ ا الح ووو لخيوووخت، و وووخماة فوووا عذاوووف ال قويوووة ومسووولدامة و ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو محوددات الأمون ال ة، و و ة السيخسووية، واسوولمرار ملخبعووة ما ومواصوواة واووود تلوووير الماووخر خت خفوووة والوووويا وووخت ال ، تلووووير ما رار ومكخفحوووة الإرهوووخ الأمووون واحسووول وووخهيل م ي يرسووو الووو الوووديها المعلووودل ياوووى الهحوووو الوووو،ها، والبلوووخ الموا،هة والسام المجلمعا .</paragraph>
|
||||
<paragraph><location><page_1><loc_13><loc_45><loc_74><loc_48></location>لور برنوخما الحكوموة المصورية لوال ال ًخ لمخ سبق، يسولاد ووف ( 2024 -2026 ) اسلراتيجية رئيسة، وها ياى الهحو الآتا يق عربعة عهدا تح :</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_37><loc_73><loc_40></location>مايــــــــة اممــــــــن القومي المصـر نسـ ـــا بنــــاء ا المصـ ـــــــــــــــــــر تصـــــاع بنـــــاء ا تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_23><loc_73><loc_31></location>إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ خر تجدر الإ د باوكل يوووة مصووور رئووويس ياوووى مسووولادفخت ر 2023 ، وتوصووويخت واسوووخت الحووووار خت الايك الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا ايوة، ومبلاف احسلراتيجيخت الو،هية .</paragraph>
|
||||
<figure>
|
||||
<location><page_1><loc_75><loc_23><loc_100><loc_76></location>
|
||||
</figure>
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,11 +1,11 @@
|
||||
11
|
||||
|
||||
وعليه، فإن الحكومة المصرية تضع صوو عييهاوخ لوال المر اوة الم باوة تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح يق يدد من الأ هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح يووق معوودحت نمووو قويوووة ومسووولدامة و وووخماة فوووا عذاوووف ال لخيوووخت، و ووو ا الح وووخ ياوووى محوددات الأمون ال ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو ة، ومواصوواة واووود تلوووير الماووخر ة السيخسووية، واسوولمرار ملخبعووة ما و و خت الأمووون واحسووول رار ومكخفحوووة الإرهوووخ ، تلووووير ما وووخت ال خفوووة والوووويا الوووو،ها، والبلوووخ الوووديها المعلووودل ياوووى الهحوووو الووو ي يرسووو م وووخهيل الموا،هة والسام المجلمعا .
|
||||
وعليه، اوة الم عييهاوخ لوال المر فإن الحكومة المصرية تضع صوو باوة يق يدد من الأ تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، يووق معوودحت نمووو لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح ياوووى وووخ ا الح ووو لخيوووخت، و وووخماة فوووا عذاوووف ال قويوووة ومسووولدامة و ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو محوددات الأمون ال ة، و و ة السيخسووية، واسوولمرار ملخبعووة ما ومواصوواة واووود تلوووير الماووخر خت خفوووة والوووويا وووخت ال ، تلووووير ما رار ومكخفحوووة الإرهوووخ الأمووون واحسووول وووخهيل م ي يرسووو الووو الوووديها المعلووودل ياوووى الهحوووو الوووو،ها، والبلوووخ الموا،هة والسام المجلمعا .
|
||||
|
||||
ووف ًخ لمخ سبق، يسولاد برنوخما الحكوموة المصورية لوال ال لور ( 2024 -2026 ) تح يق عربعة عهدا اسلراتيجية رئيسة، وها ياى الهحو الآتا :
|
||||
لور برنوخما الحكوموة المصورية لوال ال ًخ لمخ سبق، يسولاد ووف ( 2024 -2026 ) اسلراتيجية رئيسة، وها ياى الهحو الآتا يق عربعة عهدا تح :
|
||||
|
||||
مايــــــــة اممــــــــن القومي المصـر بنــــاء ا نســــا المصــــــــــــــــــــر بنـــــاء ا تصـــــاع تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي
|
||||
مايــــــــة اممــــــــن القومي المصـر نسـ ـــا بنــــاء ا المصـ ـــــــــــــــــــر تصـــــاع بنـــــاء ا تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي
|
||||
|
||||
تجدر الإ خر إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ د باوكل رئووويس ياوووى مسووولادفخت ر يوووة مصووور 2023 ، وتوصووويخت واسوووخت الحووووار الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا خت الايك ايوة، ومبلاف احسلراتيجيخت الو،هية .
|
||||
إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ خر تجدر الإ د باوكل يوووة مصووور رئووويس ياوووى مسووولادفخت ر 2023 ، وتوصووويخت واسوووخت الحووووار خت الايك الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا ايوة، ومبلاف احسلراتيجيخت الو،هية .
|
||||
|
||||
<!-- image -->
|
File diff suppressed because one or more lines are too long
@ -5,28 +5,28 @@
|
||||
</figure>
|
||||
<subtitle-level-1><location><page_1><loc_63><loc_81><loc_81><loc_84></location>2-5 -استاندارد ک الا</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_77><loc_79><loc_87><loc_81></location>نام استاندارد</paragraph>
|
||||
<paragraph><location><page_1><loc_11><loc_75><loc_44><loc_81></location>شمشه و شمشال توليد شده به روش ريخته گری پيوسته مورد مصرف در فولادهای سازه ای - مطابق آناليز پيوست</paragraph>
|
||||
<paragraph><location><page_1><loc_12><loc_75><loc_44><loc_81></location>ريخته گری به روش شده توليد و شمشال شمشه پيوسته مورد مصرف سازه ای فولادهای در - مطابق آناليز پيوست</paragraph>
|
||||
<paragraph><location><page_1><loc_71><loc_72><loc_87><loc_74></location>شماره استاندارد ملی</paragraph>
|
||||
<paragraph><location><page_1><loc_40><loc_73><loc_45><loc_74></location>20300</paragraph>
|
||||
<paragraph><location><page_1><loc_68><loc_70><loc_87><loc_72></location>استاندارد اجباری است؟</paragraph>
|
||||
<paragraph><location><page_1><loc_65><loc_67><loc_87><loc_69></location>مرجع صادرکننده استاندارد</paragraph>
|
||||
<paragraph><location><page_1><loc_28><loc_67><loc_44><loc_69></location>سازمان ملی استاندارد ايران</paragraph>
|
||||
<paragraph><location><page_1><loc_49><loc_62><loc_87><loc_66></location>آيا توليدکننده محصول، استاندارد مذکور را اخذ نموده است؟</paragraph>
|
||||
<paragraph><location><page_1><loc_50><loc_62><loc_87><loc_66></location>آيا توليدکننده محصول، استاندارد مذکور را اخذ نموده است؟</paragraph>
|
||||
<subtitle-level-1><location><page_1><loc_69><loc_56><loc_85><loc_58></location>3 -پذيرش در بورس</subtitle-level-1>
|
||||
<paragraph><location><page_1><loc_68><loc_54><loc_83><loc_56></location>تاريخ ارائه مدارک</paragraph>
|
||||
<paragraph><location><page_1><loc_69><loc_54><loc_83><loc_56></location>تاريخ ارائه مدارک</paragraph>
|
||||
<paragraph><location><page_1><loc_23><loc_54><loc_32><loc_56></location>19 / 09 / 1403</paragraph>
|
||||
<paragraph><location><page_1><loc_72><loc_51><loc_83><loc_53></location>تاريخ پذيرش</paragraph>
|
||||
<paragraph><location><page_1><loc_23><loc_51><loc_32><loc_53></location>04 / 10 / 1403</paragraph>
|
||||
<paragraph><location><page_1><loc_62><loc_48><loc_83><loc_50></location>شماره جلسه کميته عرضه</paragraph>
|
||||
<paragraph><location><page_1><loc_26><loc_49><loc_29><loc_50></location>436</paragraph>
|
||||
<paragraph><location><page_1><loc_67><loc_45><loc_83><loc_47></location>تاريخ درج اميدنامه</paragraph>
|
||||
<paragraph><location><page_1><loc_68><loc_45><loc_83><loc_47></location>تاريخ درج اميدنامه</paragraph>
|
||||
<paragraph><location><page_1><loc_23><loc_46><loc_32><loc_48></location>05 / 10 / 1403</paragraph>
|
||||
<paragraph><location><page_1><loc_71><loc_43><loc_83><loc_45></location>مشاور پذيرش</paragraph>
|
||||
<paragraph><location><page_1><loc_72><loc_43><loc_83><loc_45></location>مشاور پذيرش</paragraph>
|
||||
<paragraph><location><page_1><loc_21><loc_43><loc_34><loc_45></location>کارگزاری آ رمون بورس</paragraph>
|
||||
<paragraph><location><page_1><loc_47><loc_37><loc_83><loc_42></location>نحوة تعيين قيمت پايه پس از پذيرش کالا در بورس</paragraph>
|
||||
<paragraph><location><page_1><loc_18><loc_40><loc_36><loc_42></location>بر اساس قيمت های جهانی</paragraph>
|
||||
<paragraph><location><page_1><loc_48><loc_37><loc_83><loc_42></location>نحوة تعيين قيمت پايهپس از پذيرش کالا در بورس</paragraph>
|
||||
<paragraph><location><page_1><loc_19><loc_40><loc_36><loc_42></location>بر اساس قيمت های جهانی</paragraph>
|
||||
<paragraph><location><page_1><loc_45><loc_32><loc_83><loc_37></location>حداقل درصد عرضه از توليد / کل فروش / فروش داخلی</paragraph>
|
||||
<paragraph><location><page_1><loc_14><loc_35><loc_40><loc_37></location>حداقل 50 % از توليد ساليانه يا 47.500 تن</paragraph>
|
||||
<paragraph><location><page_1><loc_14><loc_35><loc_40><loc_37></location>حداقل 50 % يا از توليد ساليانه 47.500 تن</paragraph>
|
||||
<paragraph><location><page_1><loc_68><loc_29><loc_83><loc_31></location>خطای مجاز تحويل</paragraph>
|
||||
<paragraph><location><page_1><loc_18><loc_30><loc_37><loc_31></location>5% آخرين محموله قابل تحويل</paragraph>
|
||||
</document>
|
File diff suppressed because one or more lines are too long
@ -6,7 +6,7 @@
|
||||
|
||||
نام استاندارد
|
||||
|
||||
شمشه و شمشال توليد شده به روش ريخته گری پيوسته مورد مصرف در فولادهای سازه ای - مطابق آناليز پيوست
|
||||
ريخته گری به روش شده توليد و شمشال شمشه پيوسته مورد مصرف سازه ای فولادهای در - مطابق آناليز پيوست
|
||||
|
||||
شماره استاندارد ملی
|
||||
|
||||
@ -42,13 +42,13 @@
|
||||
|
||||
کارگزاری آ رمون بورس
|
||||
|
||||
نحوة تعيين قيمت پايه پس از پذيرش کالا در بورس
|
||||
نحوة تعيين قيمت پايهپس از پذيرش کالا در بورس
|
||||
|
||||
بر اساس قيمت های جهانی
|
||||
|
||||
حداقل درصد عرضه از توليد / کل فروش / فروش داخلی
|
||||
|
||||
حداقل 50 % از توليد ساليانه يا 47.500 تن
|
||||
حداقل 50 % يا از توليد ساليانه 47.500 تن
|
||||
|
||||
خطای مجاز تحويل
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,4 +1,5 @@
|
||||
<doctag><page_header><loc_15><loc_101><loc_30><loc_354>arXiv:2203.01017v2 [cs.CV] 11 Mar 2022</page_header>
|
||||
<doctag><page_header><loc_15><loc_133><loc_30><loc_354>arXiv:2203.01017v2 [cs.CV] 11 Mar</page_header>
|
||||
<text><loc_15><loc_101><loc_30><loc_126>2022</text>
|
||||
<section_header_level_1><loc_79><loc_68><loc_408><loc_76>TableFormer: Table Structure Understanding with Transformers.</section_header_level_1>
|
||||
<section_header_level_1><loc_116><loc_93><loc_370><loc_108>Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research</section_header_level_1>
|
||||
<text><loc_170><loc_111><loc_309><loc_116>{ ahn,nli,mly,taa @zurich.ibm.com }</text>
|
||||
@ -30,7 +31,7 @@
|
||||
</unordered_list>
|
||||
<text><loc_41><loc_411><loc_234><loc_439>The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe</text>
|
||||
<footnote><loc_50><loc_445><loc_150><loc_450>1 https://github.com/IBM/SynthTabNet</footnote>
|
||||
<text><loc_252><loc_48><loc_445><loc_68>its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.</text>
|
||||
<text><loc_252><loc_48><loc_445><loc_68>its results &performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.</text>
|
||||
<section_header_level_1><loc_252><loc_77><loc_407><loc_84>2. Previous work and State of the Art</section_header_level_1>
|
||||
<text><loc_252><loc_90><loc_445><loc_209>Identifying the structure of a table has been an outstanding problem in the document-parsing community, that motivates many organised public challenges [6, 4, 14]. The difficulty of the problem can be attributed to a number of factors. First, there is a large variety in the shapes and sizes of tables. Such large variety requires a flexible method. This is especially true for complex column- and row headers, which can be extremely intricate and demanding. A second factor of complexity is the lack of data with regard to table-structure. Until the publication of PubTabNet [37], there were no large datasets (i.e. > 100 K tables) that provided structure information. This happens primarily due to the fact that tables are notoriously time-consuming to annotate by hand. However, this has definitely changed in recent years with the deliverance of PubTabNet [37], FinTabNet [36], TableBank [17] etc.</text>
|
||||
<text><loc_252><loc_211><loc_445><loc_284>Before the rising popularity of deep neural networks, the community relied heavily on heuristic and/or statistical methods to do table structure identification [3, 7, 11, 5, 13, 28]. Although such methods work well on constrained tables [12], a more data-driven approach can be applied due to the advent of convolutional neural networks (CNNs) and the availability of large datasets. To the best-of-our knowledge, there are currently two different types of network architecture that are being pursued for state-of-the-art tablestructure identification.</text>
|
||||
@ -45,7 +46,7 @@
|
||||
<text><loc_41><loc_415><loc_234><loc_450>We rely on large-scale datasets such as PubTabNet [37], FinTabNet [36], and TableBank [17] datasets to train and evaluate our models. These datasets span over various appearance styles and content. We also introduce our own synthetically generated SynthTabNet dataset to fix an im-</text>
|
||||
<picture><loc_255><loc_50><loc_450><loc_158><caption><loc_252><loc_169><loc_445><loc_182>Figure 2: Distribution of the tables across different table dimensions in PubTabNet + FinTabNet datasets</caption></picture>
|
||||
<text><loc_252><loc_201><loc_357><loc_206>balance in the previous datasets.</text>
|
||||
<text><loc_252><loc_209><loc_445><loc_396>The PubTabNet dataset contains 509k tables delivered as annotated PNG images. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98% and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDF documents with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.</text>
|
||||
<text><loc_252><loc_209><loc_445><loc_396>The PubTabNet dataset contains 509k tables delivered as annotated PNGimages. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98%and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDFdocuments with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.</text>
|
||||
<text><loc_252><loc_399><loc_445><loc_450>Due to the heterogeneity across the dataset formats, it was necessary to combine all available data into one homogenized dataset before we could train our models for practical purposes. Given the size of PubTabNet, we adopted its annotation format and we extracted and converted all tables as PNG images with a resolution of 72 dpi. Additionally, we have filtered out tables with extreme sizes due to small</text>
|
||||
<page_footer><loc_241><loc_464><loc_245><loc_469>3</page_footer>
|
||||
<page_break>
|
||||
@ -69,7 +70,7 @@
|
||||
<text><loc_252><loc_158><loc_445><loc_186>forming classification, and adding an adaptive pooling layer of size 28*28. ResNet by default downsamples the image resolution by 32 and then the encoded image is provided to both the Structure Decoder , and Cell BBox Decoder .</text>
|
||||
<text><loc_252><loc_188><loc_445><loc_261>Structure Decoder. The transformer architecture of this component is based on the work proposed in [31]. After extensive experimentation, the Structure Decoder is modeled as a transformer encoder with two encoder layers and a transformer decoder made from a stack of 4 decoder layers that comprise mainly of multi-head attention and feed forward layers. This configuration uses fewer layers and heads in comparison to networks applied to other problems (e.g. 'Scene Understanding', 'Image Captioning'), something which we relate to the simplicity of table images.</text>
|
||||
<text><loc_252><loc_263><loc_445><loc_344>The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.</text>
|
||||
<text><loc_252><loc_346><loc_445><loc_412>Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.</text>
|
||||
<text><loc_252><loc_346><loc_445><loc_412>Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTMLstructure tags become the object query.</text>
|
||||
<text><loc_252><loc_414><loc_445><loc_450>The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-</text>
|
||||
<page_footer><loc_241><loc_464><loc_245><loc_469>5</page_footer>
|
||||
<page_break>
|
||||
@ -119,7 +120,7 @@
|
||||
<picture><loc_177><loc_240><loc_307><loc_280><caption><loc_41><loc_203><loc_445><loc_231>Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.</caption></picture>
|
||||
<picture><loc_313><loc_241><loc_443><loc_280></picture>
|
||||
<section_header_level_1><loc_41><loc_310><loc_134><loc_316>5.5. Qualitative Analysis</section_header_level_1>
|
||||
<section_header_level_1><loc_252><loc_310><loc_377><loc_317>6. Future Work & Conclusion</section_header_level_1>
|
||||
<section_header_level_1><loc_252><loc_310><loc_377><loc_317>6. Future Work &Conclusion</section_header_level_1>
|
||||
<text><loc_41><loc_339><loc_234><loc_450>We showcase several visualizations for the different components of our network on various 'complex' tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.</text>
|
||||
<text><loc_252><loc_324><loc_445><loc_412>In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce 'SynthTabNet' a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.</text>
|
||||
<section_header_level_1><loc_252><loc_424><loc_298><loc_431>References</section_header_level_1>
|
||||
@ -130,25 +131,25 @@
|
||||
<list_item><loc_57><loc_48><loc_234><loc_74>end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</list_item>
|
||||
<list_item><loc_45><loc_76><loc_234><loc_95>[2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</list_item>
|
||||
<list_item><loc_45><loc_97><loc_234><loc_116>[3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</list_item>
|
||||
<list_item><loc_45><loc_118><loc_234><loc_143>[4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</list_item>
|
||||
<list_item><loc_45><loc_118><loc_234><loc_143>[4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</list_item>
|
||||
<list_item><loc_45><loc_146><loc_234><loc_171>[5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2</list_item>
|
||||
<list_item><loc_45><loc_174><loc_234><loc_199>[6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</list_item>
|
||||
<list_item><loc_45><loc_174><loc_234><loc_199>[6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</list_item>
|
||||
<list_item><loc_45><loc_201><loc_234><loc_220>[7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2</list_item>
|
||||
<list_item><loc_45><loc_222><loc_234><loc_255>[8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1</list_item>
|
||||
<list_item><loc_45><loc_257><loc_234><loc_276>[9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1</list_item>
|
||||
<list_item><loc_41><loc_278><loc_234><loc_304>[10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2</list_item>
|
||||
<list_item><loc_41><loc_306><loc_234><loc_339>[11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2</list_item>
|
||||
<list_item><loc_41><loc_341><loc_234><loc_373>[12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2</list_item>
|
||||
<list_item><loc_41><loc_376><loc_234><loc_408>[13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</list_item>
|
||||
<list_item><loc_41><loc_376><loc_234><loc_408>[13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</list_item>
|
||||
<list_item><loc_41><loc_410><loc_234><loc_429>[14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2</list_item>
|
||||
<list_item><loc_41><loc_431><loc_234><loc_450>[15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</list_item>
|
||||
<list_item><loc_41><loc_431><loc_234><loc_450>[15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</list_item>
|
||||
<list_item><loc_252><loc_48><loc_445><loc_88>[16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4</list_item>
|
||||
<list_item><loc_252><loc_90><loc_445><loc_109>[17] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition, 2019. 2, 3</list_item>
|
||||
<list_item><loc_252><loc_111><loc_445><loc_164>[18] Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction. In Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, editors, Pattern Recognition. ICPR International Workshops and Challenges , pages 644-658, Cham, 2021. Springer International Publishing. 2, 3</list_item>
|
||||
<list_item><loc_252><loc_167><loc_445><loc_206>[19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1</list_item>
|
||||
<list_item><loc_252><loc_208><loc_445><loc_234>[20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2</list_item>
|
||||
<list_item><loc_252><loc_236><loc_445><loc_276>[21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1</list_item>
|
||||
<list_item><loc_252><loc_278><loc_445><loc_352>[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</list_item>
|
||||
<list_item><loc_252><loc_278><loc_445><loc_352>[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</list_item>
|
||||
<list_item><loc_252><loc_355><loc_445><loc_394>[23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1</list_item>
|
||||
<list_item><loc_252><loc_396><loc_445><loc_422>[24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3</list_item>
|
||||
<list_item><loc_252><loc_424><loc_445><loc_450>[25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on</list_item>
|
||||
@ -160,7 +161,7 @@
|
||||
<list_item><loc_41><loc_104><loc_234><loc_143>[27] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) , volume 1, pages 1162-1167. IEEE, 2017. 3</list_item>
|
||||
<list_item><loc_41><loc_146><loc_234><loc_171>[28] Faisal Shafait and Ray Smith. Table detection in heterogeneous documents. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , pages 6572, 2010. 2</list_item>
|
||||
<list_item><loc_41><loc_174><loc_234><loc_206>[29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3</list_item>
|
||||
<list_item><loc_41><loc_208><loc_234><loc_241>[30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1</list_item>
|
||||
<list_item><loc_41><loc_208><loc_234><loc_241>[30] Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1</list_item>
|
||||
<list_item><loc_41><loc_243><loc_234><loc_290>[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5</list_item>
|
||||
<list_item><loc_41><loc_292><loc_234><loc_317>[32] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015. 2</list_item>
|
||||
<list_item><loc_41><loc_320><loc_234><loc_345>[33] Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: reconstruct syntactic structures from table images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 749-755. IEEE, 2019. 3</list_item>
|
||||
@ -176,7 +177,7 @@
|
||||
<section_header_level_1><loc_109><loc_70><loc_380><loc_86>TableFormer: Table Structure Understanding with Transformers Supplementary Material</section_header_level_1>
|
||||
<section_header_level_1><loc_41><loc_102><loc_144><loc_109>1. Details on the datasets</section_header_level_1>
|
||||
<section_header_level_1><loc_41><loc_114><loc_123><loc_120>1.1. Data preparation</section_header_level_1>
|
||||
<text><loc_41><loc_126><loc_234><loc_245>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.</text>
|
||||
<text><loc_41><loc_126><loc_234><loc_245>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). Atable is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTMLstructure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.</text>
|
||||
<text><loc_41><loc_247><loc_234><loc_396>We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</text>
|
||||
<text><loc_41><loc_398><loc_234><loc_411>Figure 7 illustrates the distribution of the tables across different dimensions per dataset.</text>
|
||||
<section_header_level_1><loc_41><loc_418><loc_125><loc_424>1.2. Synthetic datasets</section_header_level_1>
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,3 +1,5 @@
|
||||
2022
|
||||
|
||||
## TableFormer: Table Structure Understanding with Transformers.
|
||||
|
||||
## Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research
|
||||
@ -51,7 +53,7 @@ To meet the design criteria listed above, we developed a new model called TableF
|
||||
|
||||
The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe
|
||||
|
||||
its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
|
||||
its results &performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
|
||||
|
||||
## 2. Previous work and State of the Art
|
||||
|
||||
@ -79,7 +81,7 @@ Figure 2: Distribution of the tables across different table dimensions in PubTab
|
||||
|
||||
balance in the previous datasets.
|
||||
|
||||
The PubTabNet dataset contains 509k tables delivered as annotated PNG images. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98% and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDF documents with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.
|
||||
The PubTabNet dataset contains 509k tables delivered as annotated PNGimages. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98%and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDFdocuments with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.
|
||||
|
||||
Due to the heterogeneity across the dataset formats, it was necessary to combine all available data into one homogenized dataset before we could train our models for practical purposes. Given the size of PubTabNet, we adopted its annotation format and we extracted and converted all tables as PNG images with a resolution of 72 dpi. Additionally, we have filtered out tables with extreme sizes due to small
|
||||
|
||||
@ -132,7 +134,7 @@ Structure Decoder. The transformer architecture of this component is based on th
|
||||
|
||||
The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.
|
||||
|
||||
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.
|
||||
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTMLstructure tags become the object query.
|
||||
|
||||
The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-
|
||||
|
||||
@ -269,7 +271,7 @@ Figure 5: One of the benefits of TableFormer is that it is language agnostic, as
|
||||
|
||||
## 5.5. Qualitative Analysis
|
||||
|
||||
## 6. Future Work & Conclusion
|
||||
## 6. Future Work &Conclusion
|
||||
|
||||
We showcase several visualizations for the different components of our network on various 'complex' tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.
|
||||
|
||||
@ -282,25 +284,25 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
|
||||
- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5
|
||||
- [2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3
|
||||
- [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2
|
||||
- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
|
||||
- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
|
||||
- [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2
|
||||
- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
|
||||
- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
|
||||
- [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2
|
||||
- [8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1
|
||||
- [9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1
|
||||
- [10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2
|
||||
- [11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2
|
||||
- [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2
|
||||
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
|
||||
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
|
||||
- [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2
|
||||
- [15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
|
||||
- [15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
|
||||
- [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4
|
||||
- [17] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition, 2019. 2, 3
|
||||
- [18] Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction. In Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, editors, Pattern Recognition. ICPR International Workshops and Challenges , pages 644-658, Cham, 2021. Springer International Publishing. 2, 3
|
||||
- [19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1
|
||||
- [20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2
|
||||
- [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1
|
||||
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
|
||||
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
|
||||
- [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1
|
||||
- [24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3
|
||||
- [25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on
|
||||
@ -311,7 +313,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
|
||||
- [27] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) , volume 1, pages 1162-1167. IEEE, 2017. 3
|
||||
- [28] Faisal Shafait and Ray Smith. Table detection in heterogeneous documents. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , pages 6572, 2010. 2
|
||||
- [29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3
|
||||
- [30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
|
||||
- [30] Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
|
||||
- [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5
|
||||
- [32] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015. 2
|
||||
- [33] Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: reconstruct syntactic structures from table images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 749-755. IEEE, 2019. 3
|
||||
@ -328,7 +330,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
|
||||
|
||||
## 1.1. Data preparation
|
||||
|
||||
As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.
|
||||
As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). Atable is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTMLstructure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.
|
||||
|
||||
We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,4 +1,5 @@
|
||||
<doctag><page_header><loc_15><loc_104><loc_30><loc_350>arXiv:2206.01062v1 [cs.CV] 2 Jun 2022</page_header>
|
||||
<doctag><page_header><loc_15><loc_136><loc_30><loc_350>arXiv:2206.01062v1 [cs.CV] 2 Jun</page_header>
|
||||
<text><loc_15><loc_104><loc_30><loc_129>2022</text>
|
||||
<section_header_level_1><loc_88><loc_53><loc_413><loc_75>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</section_header_level_1>
|
||||
<text><loc_74><loc_85><loc_158><loc_114>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</text>
|
||||
<text><loc_208><loc_85><loc_292><loc_114>Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com</text>
|
||||
@ -23,7 +24,7 @@
|
||||
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
|
||||
<section_header_level_1><loc_44><loc_55><loc_128><loc_61>1 INTRODUCTION</section_header_level_1>
|
||||
<text><loc_44><loc_70><loc_248><loc_144>Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.</text>
|
||||
<text><loc_44><loc_146><loc_241><loc_317>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</text>
|
||||
<text><loc_44><loc_146><loc_241><loc_317>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</text>
|
||||
<text><loc_44><loc_319><loc_241><loc_366>In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:</text>
|
||||
<unordered_list><list_item><loc_53><loc_369><loc_241><loc_388>(1) Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.</list_item>
|
||||
<list_item><loc_53><loc_390><loc_240><loc_402>(2) Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.</list_item>
|
||||
@ -37,10 +38,10 @@
|
||||
<text><loc_259><loc_106><loc_457><loc_139>All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.</text>
|
||||
<text><loc_260><loc_141><loc_457><loc_194>In Section 5, we will present baseline accuracy numbers for a variety of object detection methods (Faster R-CNN, Mask R-CNN and YOLOv5) trained on DocLayNet. We further show how the model performance is impacted by varying the DocLayNet dataset size, reducing the label set and modifying the train/test-split. Last but not least, we compare the performance of models trained on PubLayNet, DocBank and DocLayNet and demonstrate that a model trained on DocLayNet provides overall more robust layout recovery.</text>
|
||||
<section_header_level_1><loc_260><loc_203><loc_345><loc_209>2 RELATED WORK</section_header_level_1>
|
||||
<text><loc_259><loc_219><loc_457><loc_293>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</text>
|
||||
<text><loc_259><loc_219><loc_457><loc_293>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most commonapproach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</text>
|
||||
<text><loc_260><loc_295><loc_457><loc_348>Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.</text>
|
||||
<section_header_level_1><loc_260><loc_357><loc_390><loc_363>3 THE DOCLAYNET DATASET</section_header_level_1>
|
||||
<text><loc_260><loc_373><loc_457><loc_426>DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</text>
|
||||
<text><loc_260><loc_373><loc_457><loc_426>DocLayNet contains 80863 PDF pages. Amongthese, 7059 carry two instances of humanannotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</text>
|
||||
<text><loc_260><loc_428><loc_456><loc_447>In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents</text>
|
||||
<page_break>
|
||||
<page_header><loc_44><loc_38><loc_284><loc_43>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</page_header>
|
||||
@ -58,12 +59,12 @@
|
||||
<text><loc_260><loc_399><loc_457><loc_446>The annotation campaign was carried out in four phases. In phase one, we identified and prepared the data sources for annotation. In phase two, we determined the class labels and how annotations should be done on the documents in order to obtain maximum consistency. The latter was guided by a detailed requirement analysis and exhaustive experiments. In phase three, we trained the annotation staff and performed exams for quality assurance. In phase four,</text>
|
||||
<page_break>
|
||||
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
|
||||
<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
|
||||
<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><ched>triple inter-annotator mAP @0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
|
||||
<picture><loc_43><loc_196><loc_242><loc_341><caption><loc_44><loc_350><loc_242><loc_383>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption></picture>
|
||||
<text><loc_44><loc_401><loc_240><loc_426>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</text>
|
||||
<text><loc_44><loc_428><loc_241><loc_447><loc_44><loc_428><loc_241><loc_447>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
|
||||
<text><loc_44><loc_428><loc_241><loc_447><loc_44><loc_428><loc_241><loc_447>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. Alarge effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
|
||||
<text><loc_260><loc_239><loc_457><loc_320>Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.</text>
|
||||
<text><loc_259><loc_322><loc_457><loc_437>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</text>
|
||||
<text><loc_259><loc_322><loc_457><loc_437>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This wasachieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</text>
|
||||
<footnote><loc_260><loc_443><loc_302><loc_447>3 https://arxiv.org/</footnote>
|
||||
<page_break>
|
||||
<page_header><loc_44><loc_38><loc_284><loc_43>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</page_header>
|
||||
@ -84,15 +85,15 @@
|
||||
<text><loc_327><loc_290><loc_389><loc_291>05237a14f2524e3f53c8454b074409d05078038a6a36b770fcc8ec7e540deae0</text>
|
||||
<caption><loc_260><loc_299><loc_457><loc_318>Figure 4: Examples of plausible annotation alternatives for the same page. Criteria in our annotation guideline can resolve cases A to C, while the case D remains ambiguous.</caption>
|
||||
<text><loc_259><loc_332><loc_456><loc_344>were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.</text>
|
||||
<text><loc_259><loc_346><loc_457><loc_448>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted</text>
|
||||
<text><loc_259><loc_346><loc_457><loc_448>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDFtext-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). Wewanted</text>
|
||||
<page_break>
|
||||
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
|
||||
<text><loc_44><loc_55><loc_242><loc_115>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</text>
|
||||
<text><loc_44><loc_55><loc_242><loc_115>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLOimplementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</text>
|
||||
<otsl><loc_51><loc_124><loc_233><loc_222><ecel><ched>human<ched>MRCNN<lcel><ched>FRCNN<ched>YOLO<nl><ecel><ecel><ched>R50<ched>R101<ched>R101<ched>v5x6<nl><rhed>Caption<fcel>84-89<fcel>68.4<fcel>71.5<fcel>70.1<fcel>77.7<nl><rhed>Footnote<fcel>83-91<fcel>70.9<fcel>71.8<fcel>73.7<fcel>77.2<nl><rhed>Formula<fcel>83-85<fcel>60.1<fcel>63.4<fcel>63.5<fcel>66.2<nl><rhed>List-item<fcel>87-88<fcel>81.2<fcel>80.8<fcel>81.0<fcel>86.2<nl><rhed>Page-footer<fcel>93-94<fcel>61.6<fcel>59.3<fcel>58.9<fcel>61.1<nl><rhed>Page-header<fcel>85-89<fcel>71.9<fcel>70.0<fcel>72.0<fcel>67.9<nl><rhed>Picture<fcel>69-71<fcel>71.7<fcel>72.7<fcel>72.0<fcel>77.1<nl><rhed>Section-header<fcel>83-84<fcel>67.6<fcel>69.3<fcel>68.4<fcel>74.6<nl><rhed>Table<fcel>77-81<fcel>82.2<fcel>82.9<fcel>82.2<fcel>86.3<nl><rhed>Text<fcel>84-86<fcel>84.6<fcel>85.8<fcel>85.4<fcel>88.1<nl><rhed>Title<fcel>60-72<fcel>76.7<fcel>80.4<fcel>79.9<fcel>82.7<nl><rhed>All<fcel>82-83<fcel>72.4<fcel>73.5<fcel>73.4<fcel>76.8<nl></otsl>
|
||||
<text><loc_44><loc_234><loc_241><loc_364>to avoid this at any cost in order to have clear, unbiased baseline numbers for human document-layout annotation. Third, we introduced the feature of snapping boxes around text segments to obtain a pixel-accurate annotation and again reduce time and effort. The CCS annotation tool automatically shrinks every user-drawn box to the minimum bounding-box around the enclosed text-cells for all purely text-based segments, which excludes only Table and Picture . For the latter, we instructed annotation staff to minimise inclusion of surrounding whitespace while including all graphical lines. A downside of snapping boxes to enclosed text cells is that some wrongly parsed PDF pages cannot be annotated correctly and need to be skipped. Fourth, we established a way to flag pages as rejected for cases where no valid annotation according to the label guidelines could be achieved. Example cases for this would be PDF pages that render incorrectly or contain layouts that are impossible to capture with non-overlapping rectangles. Such rejected pages are not contained in the final dataset. With all these measures in place, experienced annotation staff managed to annotate a single page in a typical timeframe of 20s to 60s, depending on its complexity.</text>
|
||||
<section_header_level_1><loc_44><loc_372><loc_120><loc_378>5 EXPERIMENTS</section_header_level_1>
|
||||
<text><loc_44><loc_387><loc_241><loc_448>The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</text>
|
||||
<picture><loc_264><loc_57><loc_452><loc_164><caption><loc_260><loc_177><loc_457><loc_216>Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNet dataset with similar data will not yield significantly better predictions.</caption></picture>
|
||||
<text><loc_44><loc_387><loc_241><loc_448>The primary goal of DocLayNet is to obtain high-quality MLmodels capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</text>
|
||||
<picture><loc_264><loc_57><loc_452><loc_164><caption><loc_260><loc_177><loc_457><loc_216>Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNetdataset with similar data will not yield significantly better predictions.</caption></picture>
|
||||
<text><loc_260><loc_243><loc_456><loc_255>paper and leave the detailed evaluation of more recent methods mentioned in Section 2 for future work.</text>
|
||||
<text><loc_260><loc_257><loc_456><loc_303>In this section, we will present several aspects related to the performance of object detection models on DocLayNet. Similarly as in PubLayNet, we will evaluate the quality of their predictions using mean average precision (mAP) with 10 overlaps that range from 0.5 to 0.95 in steps of 0.05 (mAP@0.5-0.95). These scores are computed by leveraging the evaluation code provided by the COCO API [16].</text>
|
||||
<section_header_level_1><loc_260><loc_314><loc_381><loc_320>Baselines for Object Detection</section_header_level_1>
|
||||
@ -110,15 +111,15 @@
|
||||
<otsl><loc_288><loc_95><loc_427><loc_193><fcel>Class-count<ched>11<lcel><ched>5<lcel><nl><fcel>Split<ched>Doc<ched>Page<ched>Doc<ched>Page<nl><rhed>Caption<fcel>68<fcel>83<ecel><ecel><nl><rhed>Footnote<fcel>71<fcel>84<ecel><ecel><nl><rhed>Formula<fcel>60<fcel>66<ecel><ecel><nl><rhed>List-item<fcel>81<fcel>88<fcel>82<fcel>88<nl><rhed>Page-footer<fcel>62<fcel>89<ecel><ecel><nl><rhed>Page-header<fcel>72<fcel>90<ecel><ecel><nl><rhed>Picture<fcel>72<fcel>82<fcel>72<fcel>82<nl><rhed>Section-header<fcel>68<fcel>83<fcel>69<fcel>83<nl><rhed>Table<fcel>82<fcel>89<fcel>82<fcel>90<nl><rhed>Text<fcel>85<fcel>91<fcel>84<fcel>90<nl><rhed>Title<fcel>77<fcel>81<ecel><ecel><nl><rhed>All<fcel>72<fcel>84<fcel>78<fcel>87<nl></otsl>
|
||||
<text><loc_260><loc_209><loc_457><loc_263>lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items), the label set of size 4 is the closest to PubLayNet, in the assumption that the List is down-mapped to Text in PubLayNet. The results in Table 3 show that the prediction accuracy on the remaining class labels does not change significantly when other classes are merged into them. The overall macro-average improves by around 5%, in particular when Page-footer and Page-header are excluded.</text>
|
||||
<section_header_level_1><loc_260><loc_272><loc_449><loc_277>Impact of Document Split in Train and Test Set</section_header_level_1>
|
||||
<text><loc_259><loc_281><loc_457><loc_376>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</text>
|
||||
<text><loc_259><loc_281><loc_457><loc_376>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</text>
|
||||
<section_header_level_1><loc_260><loc_385><loc_342><loc_390>Dataset Comparison</section_header_level_1>
|
||||
<text><loc_260><loc_394><loc_457><loc_447>Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,</text>
|
||||
<page_break>
|
||||
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
|
||||
<text><loc_44><loc_55><loc_242><loc_95>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.</text>
|
||||
<text><loc_44><loc_55><loc_242><loc_95>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.</text>
|
||||
<otsl><loc_59><loc_109><loc_225><loc_215><ecel><ecel><ched>Testing on<lcel><lcel><nl><ched>Training on<ched>labels<ched>PLN<ched>DB<ched>DLN<nl><rhed>PubLayNet (PLN)<rhed>Figure<fcel>96<fcel>43<fcel>23<nl><ucel><rhed>Sec-header<fcel>87<fcel>-<fcel>32<nl><ecel><rhed>Table<fcel>95<fcel>24<fcel>49<nl><ecel><rhed>Text<fcel>96<fcel>-<fcel>42<nl><ecel><rhed>total<fcel>93<fcel>34<fcel>30<nl><rhed>DocBank (DB)<rhed>Figure<fcel>77<fcel>71<fcel>31<nl><ucel><rhed>Table<fcel>19<fcel>65<fcel>22<nl><ucel><rhed>total<fcel>48<fcel>68<fcel>27<nl><rhed>DocLayNet (DLN)<rhed>Figure<fcel>67<fcel>51<fcel>72<nl><ucel><rhed>Sec-header<fcel>53<fcel>-<fcel>68<nl><ecel><rhed>Table<fcel>87<fcel>43<fcel>82<nl><ecel><rhed>Text<fcel>77<fcel>-<fcel>84<nl><ecel><rhed>total<fcel>59<fcel>47<fcel>78<nl></otsl>
|
||||
<text><loc_44><loc_247><loc_240><loc_280>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</text>
|
||||
<text><loc_44><loc_282><loc_241><loc_370>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</text>
|
||||
<text><loc_44><loc_282><loc_241><loc_370>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. Wehad to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</text>
|
||||
<section_header_level_1><loc_44><loc_383><loc_127><loc_388>Example Predictions</section_header_level_1>
|
||||
<text><loc_44><loc_392><loc_241><loc_445>To conclude this section, we illustrate the quality of layout predictions one can expect from DocLayNet-trained models by providing a selection of examples without any further post-processing applied. Figure 6 shows selected layout predictions on pages from the test-set of DocLayNet. Results look decent in general across document categories, however one can also observe mistakes such as overlapping clusters of different classes, or entirely missing boxes due to low confidence.</text>
|
||||
<section_header_level_1><loc_260><loc_55><loc_331><loc_61>6 CONCLUSION</section_header_level_1>
|
||||
@ -129,7 +130,7 @@
|
||||
<unordered_list><list_item><loc_262><loc_220><loc_456><loc_234>[1] Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.</list_item>
|
||||
<list_item><loc_262><loc_235><loc_457><loc_254>[2] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.</list_item>
|
||||
<list_item><loc_262><loc_256><loc_456><loc_269>[3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.</list_item>
|
||||
<list_item><loc_262><loc_271><loc_457><loc_290>[4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.</list_item>
|
||||
<list_item><loc_262><loc_271><loc_457><loc_290>[4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS12824, SpringerVerlag, sep 2021.</list_item>
|
||||
<list_item><loc_262><loc_291><loc_457><loc_310>[5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.</list_item>
|
||||
<list_item><loc_262><loc_311><loc_456><loc_325>[6] Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 1015-1022, sep 2019.</list_item>
|
||||
<list_item><loc_262><loc_326><loc_457><loc_350>[7] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics , COLING, pages 949-960. International Committee on Computational Linguistics, dec 2020.</list_item>
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,3 +1,5 @@
|
||||
2022
|
||||
|
||||
## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
|
||||
|
||||
Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com
|
||||
@ -44,7 +46,7 @@ Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staa
|
||||
|
||||
Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.
|
||||
|
||||
Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
|
||||
Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
|
||||
|
||||
In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:
|
||||
|
||||
@ -63,13 +65,13 @@ In Section 5, we will present baseline accuracy numbers for a variety of object
|
||||
|
||||
## 2 RELATED WORK
|
||||
|
||||
While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].
|
||||
While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most commonapproach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].
|
||||
|
||||
Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.
|
||||
|
||||
## 3 THE DOCLAYNET DATASET
|
||||
|
||||
DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.
|
||||
DocLayNet contains 80863 PDF pages. Amongthese, 7059 carry two instances of humanannotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.
|
||||
|
||||
In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents
|
||||
|
||||
@ -97,21 +99,21 @@ The annotation campaign was carried out in four phases. In phase one, we identif
|
||||
|
||||
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.
|
||||
|
||||
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|
||||
|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
|
||||
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
|
||||
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
|
||||
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
|
||||
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
|
||||
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
|
||||
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
|
||||
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
|
||||
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
|
||||
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
|
||||
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
|
||||
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
|
||||
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
|
||||
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
|
||||
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) |
|
||||
|----------------|---------|--------------|--------------|--------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|
|
||||
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
|
||||
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
|
||||
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
|
||||
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
|
||||
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
|
||||
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
|
||||
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
|
||||
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
|
||||
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
|
||||
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
|
||||
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
|
||||
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
|
||||
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
|
||||
|
||||
Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
|
||||
|
||||
@ -119,11 +121,11 @@ Figure 3: Corpus Conversion Service annotation user interface. The PDF page is s
|
||||
|
||||
we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.
|
||||
|
||||
Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.
|
||||
Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. Alarge effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.
|
||||
|
||||
Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.
|
||||
|
||||
Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
|
||||
Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This wasachieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
|
||||
|
||||
the textual content of an element, which goes beyond visual layout recognition, in particular outside the Scientific Articles category.
|
||||
|
||||
@ -148,9 +150,9 @@ Phase 3: Training. After a first trial with a small group of people, we realised
|
||||
|
||||
were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.
|
||||
|
||||
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted
|
||||
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDFtext-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). Wewanted
|
||||
|
||||
Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.
|
||||
Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLOimplementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.
|
||||
|
||||
| | human | MRCNN | MRCNN | FRCNN | YOLO |
|
||||
|----------------|---------|---------|---------|---------|--------|
|
||||
@ -172,9 +174,9 @@ to avoid this at any cost in order to have clear, unbiased baseline numbers for
|
||||
|
||||
## 5 EXPERIMENTS
|
||||
|
||||
The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this
|
||||
The primary goal of DocLayNet is to obtain high-quality MLmodels capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this
|
||||
|
||||
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNet dataset with similar data will not yield significantly better predictions.
|
||||
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNetdataset with similar data will not yield significantly better predictions.
|
||||
|
||||
<!-- image -->
|
||||
|
||||
@ -233,13 +235,13 @@ lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items),
|
||||
|
||||
## Impact of Document Split in Train and Test Set
|
||||
|
||||
Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
|
||||
Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
|
||||
|
||||
## Dataset Comparison
|
||||
|
||||
Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,
|
||||
|
||||
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.
|
||||
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.
|
||||
|
||||
| | | Testing on | Testing on | Testing on |
|
||||
|-----------------|------------|--------------|--------------|--------------|
|
||||
@ -260,7 +262,7 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
|
||||
|
||||
Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .
|
||||
|
||||
For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.
|
||||
For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. Wehad to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.
|
||||
|
||||
## Example Predictions
|
||||
|
||||
@ -279,7 +281,7 @@ To date, there is still a significant gap between human and ML accuracy on the l
|
||||
- [1] Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.
|
||||
- [2] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.
|
||||
- [3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.
|
||||
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.
|
||||
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS12824, SpringerVerlag, sep 2021.
|
||||
- [5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.
|
||||
- [6] Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 1015-1022, sep 2019.
|
||||
- [7] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics , COLING, pages 949-960. International Committee on Computational Linguistics, dec 2020.
|
||||
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -1,4 +1,5 @@
|
||||
<doctag><page_header><loc_15><loc_104><loc_30><loc_350>arXiv:2305.03393v1 [cs.CV] 5 May 2023</page_header>
|
||||
<doctag><page_header><loc_15><loc_136><loc_30><loc_350>arXiv:2305.03393v1 [cs.CV] 5 May</page_header>
|
||||
<text><loc_15><loc_104><loc_30><loc_129>2023</text>
|
||||
<section_header_level_1><loc_110><loc_73><loc_393><loc_92>Optimized Table Tokenization for Table Structure Recognition</section_header_level_1>
|
||||
<text><loc_114><loc_107><loc_389><loc_126>Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]</text>
|
||||
<text><loc_188><loc_123><loc_244><loc_129>and Peter Staar</text>
|
||||
@ -27,7 +28,7 @@
|
||||
<page_header><loc_110><loc_58><loc_114><loc_65>4</page_header>
|
||||
<page_header><loc_137><loc_58><loc_189><loc_65>M. Lysak, et al.</page_header>
|
||||
<text><loc_110><loc_75><loc_393><loc_164>Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.</text>
|
||||
<text><loc_110><loc_166><loc_393><loc_307>Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.</text>
|
||||
<text><loc_110><loc_166><loc_393><loc_307>Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCRand uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.</text>
|
||||
<text><loc_110><loc_309><loc_393><loc_368>Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.</text>
|
||||
<section_header_level_1><loc_110><loc_382><loc_220><loc_389>3 Problem Statement</section_header_level_1>
|
||||
<text><loc_110><loc_399><loc_393><loc_420>All known Im2Seq based models for TSR fundamentally work in similar ways. Given an image of a table, the Im2Seq model predicts the structure of the table by generating a sequence of tokens. These tokens originate from a finite vocab-</text>
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,3 +1,5 @@
|
||||
2023
|
||||
|
||||
## Optimized Table Tokenization for Table Structure Recognition
|
||||
|
||||
Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]
|
||||
@ -38,7 +40,7 @@ Approaches to formalize the logical structure and layout of tables in electronic
|
||||
|
||||
Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.
|
||||
|
||||
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
|
||||
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCRand uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
|
||||
|
||||
Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@ -1,17 +1,17 @@
|
||||
<doctag><text><loc_61><loc_30><loc_262><loc_59>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</text>
|
||||
<text><loc_61><loc_70><loc_262><loc_116>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.</text>
|
||||
<section_header_level_1><loc_61><loc_127><loc_141><loc_132>Boots Self-Locking Nut</section_header_level_1>
|
||||
<text><loc_61><loc_136><loc_262><loc_182>nut is of one piece, all-metal The Boots self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</text>
|
||||
<text><loc_61><loc_193><loc_262><loc_238>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</text>
|
||||
<text><loc_61><loc_249><loc_262><loc_311>The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</text>
|
||||
<text><loc_61><loc_322><loc_262><loc_335>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</text>
|
||||
<picture><loc_59><loc_343><loc_261><loc_449><caption><loc_61><loc_455><loc_155><loc_460>Figure 7-26. Self-locking nuts.</caption></picture>
|
||||
<text><loc_270><loc_30><loc_472><loc_76>the most common ranges in size for No. 6 up to 1 4 inch, the / Rol-top ranges from 1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</text>
|
||||
<text><loc_270><loc_78><loc_274><loc_84>.</text>
|
||||
<section_header_level_1><loc_270><loc_86><loc_380><loc_92>Stainless Steel Self-Locking Nut</section_header_level_1>
|
||||
<text><loc_270><loc_96><loc_472><loc_230>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</text>
|
||||
<section_header_level_1><loc_270><loc_241><loc_327><loc_246>Elastic Stop Nut</section_header_level_1>
|
||||
<text><loc_270><loc_250><loc_470><loc_264>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</text>
|
||||
<picture><loc_270><loc_272><loc_470><loc_447><caption><loc_270><loc_454><loc_405><loc_459>Figure 7-27. Stainless steel self-locking nut.</caption></picture>
|
||||
<page_footer><loc_453><loc_472><loc_472><loc_478>7-45</page_footer>
|
||||
<doctag><text><loc_61><loc_30><loc_260><loc_59>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</text>
|
||||
<text><loc_61><loc_70><loc_260><loc_116>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; andthe elastic stop nut, representing the fiber insert type.</text>
|
||||
<section_header_level_1><loc_61><loc_127><loc_139><loc_132>Boots Self-Locking Nut</section_header_level_1>
|
||||
<text><loc_61><loc_136><loc_260><loc_182>The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</text>
|
||||
<text><loc_61><loc_193><loc_260><loc_238>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</text>
|
||||
<text><loc_61><loc_249><loc_260><loc_311>The spring, through the mediumofthe locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</text>
|
||||
<text><loc_61><loc_322><loc_260><loc_335>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</text>
|
||||
<picture><loc_59><loc_343><loc_261><loc_449><caption><loc_61><loc_455><loc_153><loc_460>Figure 7-26. Self-locking nuts.</caption></picture>
|
||||
<text><loc_270><loc_30><loc_470><loc_76>the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</text>
|
||||
<text><loc_270><loc_78><loc_272><loc_84>.</text>
|
||||
<section_header_level_1><loc_270><loc_86><loc_378><loc_92>Stainless Steel Self-Locking Nut</section_header_level_1>
|
||||
<text><loc_270><loc_96><loc_470><loc_230>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</text>
|
||||
<section_header_level_1><loc_270><loc_241><loc_325><loc_246>Elastic Stop Nut</section_header_level_1>
|
||||
<text><loc_270><loc_250><loc_470><loc_264>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</text>
|
||||
<picture><loc_270><loc_272><loc_470><loc_447><caption><loc_270><loc_454><loc_404><loc_459>Figure 7-27. Stainless steel self-locking nut.</caption></picture>
|
||||
<page_footer><loc_453><loc_472><loc_470><loc_478>7-45</page_footer>
|
||||
</doctag>
|
File diff suppressed because one or more lines are too long
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user