Update test-cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
Christoph Auer 2025-03-25 13:49:08 +01:00
commit f1f7df49e3
131 changed files with 3536 additions and 1629 deletions

11
.actor/.dockerignore Normal file
View File

@ -0,0 +1,11 @@
**/__pycache__
**/*.pyc
**/*.pyo
**/*.pyd
.git
.gitignore
.env
.venv
*.log
.pytest_cache
.coverage

69
.actor/CHANGELOG.md Normal file
View File

@ -0,0 +1,69 @@
# Changelog
All notable changes to the Docling Actor will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [1.1.0] - 2025-03-09
### Changed
- Switched from full Docling CLI to docling-serve API
- Using the official quay.io/ds4sd/docling-serve-cpu Docker image
- Reduced Docker image size (from ~6GB to ~4GB)
- Implemented multi-stage Docker build to handle dependencies
- Improved Docker build process to ensure compatibility with docling-serve-cpu image
- Added new Python processor script for reliable API communication and content extraction
- Enhanced response handling with better content extraction logic
- Fixed ES modules compatibility issue with Apify CLI
- Added explicit tmpfs volume for temporary files
- Fixed environment variables format in actor.json
- Created optimized dependency installation approach
- Improved API compatibility with docling-serve
- Updated endpoint from custom `/convert` to standard `/v1alpha/convert/source`
- Revised JSON payload structure to match docling-serve API format
- Added proper output field parsing based on format
- Enhanced startup process with health checks
- Added configurable API host and port through environment variables
- Better content type handling for different output formats
- Updated error handling to align with API responses
### Fixed
- Fixed actor input file conflict in get_actor_input(): now checks for and removes an existing /tmp/actor-input/INPUT directory if found, ensuring valid JSON input parsing.
### Technical Details
- Actor Specification v1
- Using quay.io/ds4sd/docling-serve-cpu:latest base image
- Node.js 20.x for Apify CLI
- Eliminated Python dependencies
- Simplified Docker build process
## [1.0.0] - 2025-02-07
### Added
- Initial release of Docling Actor
- Support for multiple document formats (PDF, DOCX, images)
- OCR capabilities for scanned documents
- Multiple output formats (md, json, html, text, doctags)
- Comprehensive error handling and logging
- Dataset records with processing status
- Memory monitoring and resource optimization
- Security features including non-root user execution
### Technical Details
- Actor Specification v1
- Docling v2.17.0
- Python 3.11
- Node.js 20.x
- Comprehensive error codes:
- 10: Invalid input
- 11: URL inaccessible
- 12: Docling processing failed
- 13: Output file missing
- 14: Storage operation failed
- 15: OCR processing failed

87
.actor/Dockerfile Normal file
View File

@ -0,0 +1,87 @@
# Build stage for installing dependencies
FROM node:20-slim AS builder
# Install necessary tools and prepare dependencies environment in one layer
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
&& rm -rf /var/lib/apt/lists/* \
&& mkdir -p /build/bin /build/lib/node_modules \
&& cp /usr/local/bin/node /build/bin/
# Set working directory
WORKDIR /build
# Create package.json and install Apify CLI in one layer
RUN echo '{"name":"docling-actor-dependencies","version":"1.0.0","description":"Dependencies for Docling Actor","private":true,"type":"module","engines":{"node":">=18"}}' > package.json \
&& npm install apify-cli@latest \
&& cp -r node_modules/* lib/node_modules/ \
&& echo '#!/bin/sh\n/tmp/docling-tools/bin/node /tmp/docling-tools/lib/node_modules/apify-cli/bin/run "$@"' > bin/actor \
&& chmod +x bin/actor \
# Clean up npm cache to reduce image size
&& npm cache clean --force
# Final stage with docling-serve-cpu
FROM quay.io/ds4sd/docling-serve-cpu:latest
LABEL maintainer="Vaclav Vancura <@vancura>" \
description="Apify Actor for document processing using Docling" \
version="1.1.0"
# Set only essential environment variables
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
DOCLING_SERVE_HOST=0.0.0.0 \
DOCLING_SERVE_PORT=5001
# Switch to root temporarily to set up directories and permissions
USER root
WORKDIR /app
# Install required tools and create directories in a single layer
RUN dnf install -y \
jq \
&& dnf clean all \
&& mkdir -p /build-files \
/tmp \
/tmp/actor-input \
/tmp/actor-output \
/tmp/actor-storage \
/tmp/apify_input \
/apify_input \
/opt/app-root/src/.EasyOCR/user_network \
/tmp/easyocr-models \
&& chown 1000:1000 /build-files \
&& chown -R 1000:1000 /opt/app-root/src/.EasyOCR \
&& chmod 1777 /tmp \
&& chmod 1777 /tmp/easyocr-models \
&& chmod 777 /tmp/actor-input /tmp/actor-output /tmp/actor-storage /tmp/apify_input /apify_input \
# Fix for uv_os_get_passwd error in Node.js
&& echo "docling:x:1000:1000:Docling User:/app:/bin/sh" >> /etc/passwd
# Set environment variable to tell EasyOCR to use a writable location for models
ENV EASYOCR_MODULE_PATH=/tmp/easyocr-models
# Copy only required files
COPY --chown=1000:1000 .actor/actor.sh .actor/actor.sh
COPY --chown=1000:1000 .actor/actor.json .actor/actor.json
COPY --chown=1000:1000 .actor/input_schema.json .actor/input_schema.json
COPY --chown=1000:1000 .actor/docling_processor.py .actor/docling_processor.py
RUN chmod +x .actor/actor.sh
# Copy the build files from builder
COPY --from=builder --chown=1000:1000 /build /build-files
# Switch to non-root user
USER 1000
# Set up TMPFS for temporary files
VOLUME ["/tmp"]
# Create additional volumes for OCR models persistence
VOLUME ["/tmp/easyocr-models"]
# Expose the docling-serve API port
EXPOSE 5001
# Run the actor script
ENTRYPOINT [".actor/actor.sh"]

314
.actor/README.md Normal file
View File

@ -0,0 +1,314 @@
# Docling Actor on Apify
[![Docling Actor](https://apify.com/actor-badge?actor=vancura/docling?fpr=docling)](https://apify.com/vancura/docling)
This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.io/docling/) to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.
## What are Actors?
[Actors](https://docs.apify.com/platform/actors?fpr=docling) are serverless microservices running on the [Apify Platform](https://apify.com/?fpr=docling). They are based on the [Actor SDK](https://docs.apify.com/sdk/js?fpr=docling) and can be found in the [Apify Store](https://apify.com/store?fpr=docling). Learn more about Actors in the [Apify Whitepaper](https://whitepaper.actor?fpr=docling).
## Table of Contents
1. [Features](#features)
2. [Usage](#usage)
3. [Input Parameters](#input-parameters)
4. [Output](#output)
5. [Performance & Resources](#performance--resources)
6. [Troubleshooting](#troubleshooting)
7. [Local Development](#local-development)
8. [Architecture](#architecture)
9. [License](#license)
10. [Acknowledgments](#acknowledgments)
11. [Security Considerations](#security-considerations)
## Features
- Leverages the official docling-serve-cpu Docker image for efficient document processing
- Processes multiple document formats:
- PDF documents (scanned or digital)
- Microsoft Office files (DOCX, XLSX, PPTX)
- Images (PNG, JPG, TIFF)
- Other text-based formats
- Provides OCR capabilities for scanned documents
- Exports to multiple formats:
- Markdown
- JSON
- HTML
- Plain Text
- DocTags (structured format)
- No local setup needed—just provide input via a simple JSON config
## Usage
### Using Apify Console
1. Go to the Apify Actor page.
2. Click "Run".
3. In the input form, fill in:
- The URL of the document.
- Output format (`md`, `json`, `html`, `text`, or `doctags`).
- OCR boolean toggle.
4. The Actor will run and produce its outputs in the default key-value store under the key `OUTPUT`.
### Using Apify API
```bash
curl --request POST \
--url "https://api.apify.com/v2/acts/vancura~docling/run" \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer YOUR_API_TOKEN' \
--data '{
"options": {
"to_formats": ["md", "json", "html", "text", "doctags"]
},
"http_sources": [
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
{"url": "https://arxiv.org/pdf/2408.09869"}
]
}'
```
### Using Apify CLI
```bash
apify call vancura/docling --input='{
"options": {
"to_formats": ["md", "json", "html", "text", "doctags"]
},
"http_sources": [
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
{"url": "https://arxiv.org/pdf/2408.09869"}
]
}'
```
## Input Parameters
The Actor accepts a JSON schema matching the file `.actor/input_schema.json`. Below is a summary of the fields:
| Field | Type | Required | Default | Description |
|----------------|---------|----------|----------|-------------------------------------------------------------------------------|
| `http_sources` | object | Yes | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#url-endpoint |
| `options` | object | No | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#common-parameters |
### Example Input
```json
{
"options": {
"to_formats": ["md", "json", "html", "text", "doctags"]
},
"http_sources": [
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
{"url": "https://arxiv.org/pdf/2408.09869"}
]
}
```
## Output
The Actor provides three types of outputs:
1. **Processed Documents in a ZIP** - The Actor will provide the direct URL to your result in the run log, looking like:
```text
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'
```
2. **Processing Log** - Available in the key-value store as `DOCLING_LOG`
3. **Dataset Record** - Contains processing metadata with:
- Direct link to the processed output zip file
- Processing status
You can access the results in several ways:
1. **Direct URL** (shown in Actor run logs):
```text
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT
```
2. **Programmatically** via Apify CLI:
```bash
apify key-value-stores get-value OUTPUT
```
3. **Dataset** - Check the "Dataset" tab in the Actor run details to see processing metadata
### Example Outputs
#### Markdown (md)
```markdown
# Document Title
## Section 1
Content of section 1...
## Section 2
Content of section 2...
```
#### JSON
```json
{
"title": "Document Title",
"sections": [
{
"level": 1,
"title": "Section 1",
"content": "Content of section 1..."
}
]
}
```
#### HTML
```html
<h1>Document Title</h1>
<h2>Section 1</h2>
<p>Content of section 1...</p>
```
### Processing Logs (`DOCLING_LOG`)
The Actor maintains detailed processing logs including:
- API request and response details
- Processing steps and timing
- Error messages and stack traces
- Input validation results
Access logs via:
```bash
apify key-value-stores get-record DOCLING_LOG
```
## Performance & Resources
- **Docker Image Size**: ~4GB
- **Memory Requirements**:
- Minimum: 2 GB RAM
- Recommended: 4 GB RAM for large or complex documents
- **Processing Time**:
- Simple documents: 15-30 seconds
- Complex PDFs with OCR: 1-3 minutes
- Large documents (100+ pages): 3-10 minutes
## Troubleshooting
Common issues and solutions:
1. **Document URL Not Accessible**
- Ensure the URL is publicly accessible
- Check if the document requires authentication
- Verify the URL leads directly to the document
2. **OCR Processing Fails**
- Verify the document is not password-protected
- Check if the image quality is sufficient
- Try processing with OCR disabled
3. **API Response Issues**
- Check the logs for detailed error messages
- Ensure the document format is supported
- Verify the URL is correctly formatted
4. **Output Format Issues**
- Verify the output format is supported
- Check if the document structure is compatible
- Review the `DOCLING_LOG` for specific errors
### Error Handling
The Actor implements comprehensive error handling:
- Detailed error messages in `DOCLING_LOG`
- Proper exit codes for different failure scenarios
- Automatic cleanup on failure
- Dataset records with processing status
## Local Development
If you wish to develop or modify this Actor locally:
1. Clone the repository.
2. Ensure Docker is installed.
3. The Actor files are located in the `.actor` directory:
- `Dockerfile` - Defines the container environment
- `actor.json` - Actor configuration and metadata
- `actor.sh` - Main execution script that starts the docling-serve API and orchestrates document processing
- `input_schema.json` - Input parameter definitions
- `dataset_schema.json` - Dataset output format definition
- `CHANGELOG.md` - Change log documenting all notable changes
- `README.md` - This documentation
4. Run the Actor locally using:
```bash
apify run
```
### Actor Structure
```text
.actor/
├── Dockerfile # Container definition
├── actor.json # Actor metadata
├── actor.sh # Execution script (also starts docling-serve API)
├── input_schema.json # Input parameters
├── dataset_schema.json # Dataset output format definition
├── docling_processor.py # Python script for API communication
├── CHANGELOG.md # Version history and changes
└── README.md # This documentation
```
## Architecture
This Actor uses a lightweight architecture based on the official `quay.io/ds4sd/docling-serve-cpu` Docker image:
- **Base Image**: `quay.io/ds4sd/docling-serve-cpu:latest` (~4GB)
- **Multi-Stage Build**: Uses a multi-stage Docker build to include only necessary tools
- **API Communication**: Uses the RESTful API provided by docling-serve
- **Request Flow**:
1. The actor script starts the docling-serve API on port 5001
2. Performs health checks to ensure the API is running
3. Processes the input parameters
4. Creates a JSON payload for the docling-serve API with proper format:
```json
{
"options": {
"to_formats": ["md"],
"do_ocr": true
},
"http_sources": [{"url": "https://example.com/document.pdf"}]
}
```
5. Makes a POST request to the `/v1alpha/convert/source` endpoint
6. Processes the response and stores it in the key-value store
- **Dependencies**:
- Node.js for Apify CLI
- Essential tools (curl, jq, etc.) copied from build stage
- **Security**: Runs as a non-root user for enhanced security
## License
This wrapper project is under the MIT License, matching the original Docling license. See [LICENSE](../LICENSE) for details.
## Acknowledgments
- [Docling](https://ds4sd.github.io/docling/) and [docling-serve-cpu](https://quay.io/repository/ds4sd/docling-serve-cpu) by IBM
- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
## Security Considerations
- Actor runs under a non-root user for enhanced security
- Input URLs are validated before processing
- Temporary files are securely managed and cleaned up
- Process isolation through Docker containerization
- Secure handling of processing artifacts

11
.actor/actor.json Normal file
View File

@ -0,0 +1,11 @@
{
"actorSpecification": 1,
"name": "docling",
"version": "0.0",
"environmentVariables": {},
"dockerFile": "./Dockerfile",
"input": "./input_schema.json",
"scripts": {
"run": "./actor.sh"
}
}

419
.actor/actor.sh Executable file
View File

@ -0,0 +1,419 @@
#!/bin/bash
export PATH=$PATH:/build-files/node_modules/.bin
# Function to upload content to the key-value store
upload_to_kvs() {
local content_file="$1"
local key_name="$2"
local content_type="$3"
local description="$4"
# Find the Apify CLI command
find_apify_cmd
local apify_cmd="$FOUND_APIFY_CMD"
if [ -n "$apify_cmd" ]; then
echo "Uploading $description to key-value store (key: $key_name)..."
# Create a temporary home directory with write permissions
setup_temp_environment
# Use the --no-update-notifier flag if available
if $apify_cmd --help | grep -q "\--no-update-notifier"; then
if $apify_cmd --no-update-notifier actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
echo "Successfully uploaded $description to key-value store"
local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
echo "$description available at: $url"
cleanup_temp_environment
return 0
fi
else
# Fall back to regular command if flag isn't available
if $apify_cmd actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
echo "Successfully uploaded $description to key-value store"
local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
echo "$description available at: $url"
cleanup_temp_environment
return 0
fi
fi
echo "ERROR: Failed to upload $description to key-value store"
cleanup_temp_environment
return 1
else
echo "ERROR: Apify CLI not found for $description upload"
return 1
fi
}
# Function to find Apify CLI command
find_apify_cmd() {
FOUND_APIFY_CMD=""
for cmd in "apify" "actor" "/usr/local/bin/apify" "/usr/bin/apify" "/opt/apify/cli/bin/apify"; do
if command -v "$cmd" &> /dev/null; then
FOUND_APIFY_CMD="$cmd"
break
fi
done
}
# Function to set up temporary environment for Apify CLI
setup_temp_environment() {
export TMPDIR="/tmp/apify-home-${RANDOM}"
mkdir -p "$TMPDIR"
export APIFY_DISABLE_VERSION_CHECK=1
export NODE_OPTIONS="--no-warnings"
export HOME="$TMPDIR" # Override home directory to writable location
}
# Function to clean up temporary environment
cleanup_temp_environment() {
rm -rf "$TMPDIR" 2>/dev/null || true
}
# Function to push data to Apify dataset
push_to_dataset() {
# Example usage: push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
local result_url="$1"
local size="$2"
local format="$3"
# Find Apify CLI command
find_apify_cmd
local apify_cmd="$FOUND_APIFY_CMD"
if [ -n "$apify_cmd" ]; then
echo "Adding record to dataset..."
setup_temp_environment
# Use the --no-update-notifier flag if available
if $apify_cmd --help | grep -q "\--no-update-notifier"; then
if $apify_cmd --no-update-notifier actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
echo "Successfully added record to dataset"
else
echo "Warning: Failed to add record to dataset"
fi
else
# Fall back to regular command
if $apify_cmd actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
echo "Successfully added record to dataset"
else
echo "Warning: Failed to add record to dataset"
fi
fi
cleanup_temp_environment
fi
}
# --- Setup logging and error handling ---
LOG_FILE="/tmp/docling.log"
touch "$LOG_FILE" || {
echo "Fatal: Cannot create log file at $LOG_FILE"
exit 1
}
# Log to both console and file
exec 1> >(tee -a "$LOG_FILE")
exec 2> >(tee -a "$LOG_FILE" >&2)
# Exit codes
readonly ERR_API_UNAVAILABLE=15
readonly ERR_INVALID_INPUT=16
# --- Debug environment ---
echo "Date: $(date)"
echo "Python version: $(python --version 2>&1)"
echo "Docling-serve path: $(which docling-serve 2>/dev/null || echo 'Not found')"
echo "Working directory: $(pwd)"
# --- Get input ---
echo "Getting Apify Actor Input"
INPUT=$(apify actor get-input 2>/dev/null)
# --- Setup tools ---
echo "Setting up tools..."
TOOLS_DIR="/tmp/docling-tools"
mkdir -p "$TOOLS_DIR"
# Copy tools if available
if [ -d "/build-files" ]; then
echo "Copying tools from /build-files..."
cp -r /build-files/* "$TOOLS_DIR/"
export PATH="$TOOLS_DIR/bin:$PATH"
else
echo "Warning: No build files directory found. Some tools may be unavailable."
fi
# Copy Python processor script to tools directory
PYTHON_SCRIPT_PATH="$(dirname "$0")/docling_processor.py"
if [ -f "$PYTHON_SCRIPT_PATH" ]; then
echo "Copying Python processor script to tools directory..."
cp "$PYTHON_SCRIPT_PATH" "$TOOLS_DIR/"
chmod +x "$TOOLS_DIR/docling_processor.py"
else
echo "ERROR: Python processor script not found at $PYTHON_SCRIPT_PATH"
exit 1
fi
# Check OCR directories and ensure they're writable
echo "Checking OCR directory permissions..."
OCR_DIR="/opt/app-root/src/.EasyOCR"
if [ -d "$OCR_DIR" ]; then
# Test if we can write to the directory
if touch "$OCR_DIR/test_write" 2>/dev/null; then
echo "[✓] OCR directory is writable"
rm "$OCR_DIR/test_write"
else
echo "[✗] OCR directory is not writable, setting up alternative in /tmp"
# Create alternative in /tmp (which is writable)
mkdir -p "/tmp/.EasyOCR/user_network"
export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
fi
else
echo "OCR directory not found, creating in /tmp"
mkdir -p "/tmp/.EasyOCR/user_network"
export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
fi
# --- Starting the API ---
echo "Starting docling-serve API..."
# Create a dedicated working directory in /tmp (writable)
API_DIR="/tmp/docling-api"
mkdir -p "$API_DIR"
cd "$API_DIR"
echo "API working directory: $(pwd)"
# Find docling-serve executable
DOCLING_SERVE_PATH=$(which docling-serve)
echo "Docling-serve executable: $DOCLING_SERVE_PATH"
# Start the API with minimal parameters to avoid any issues
echo "Starting docling-serve API..."
"$DOCLING_SERVE_PATH" run --host 0.0.0.0 --port 5001 > "$API_DIR/docling-serve.log" 2>&1 &
API_PID=$!
echo "Started docling-serve API with PID: $API_PID"
# A more reliable wait for API startup
echo "Waiting for API to initialize..."
MAX_TRIES=30
tries=0
started=false
while [ $tries -lt $MAX_TRIES ]; do
tries=$((tries + 1))
# Check if process is still running
if ! ps -p $API_PID > /dev/null; then
echo "ERROR: docling-serve API process terminated unexpectedly after $tries seconds"
break
fi
# Check log for startup completion or errors
if grep -q "Application startup complete" "$API_DIR/docling-serve.log" 2>/dev/null; then
echo "[✓] API startup completed successfully after $tries seconds"
started=true
break
fi
if grep -q "Permission denied\|PermissionError" "$API_DIR/docling-serve.log" 2>/dev/null; then
echo "ERROR: Permission errors detected in API startup"
break
fi
# Sleep and check again
sleep 1
# Output a progress indicator every 5 seconds
if [ $((tries % 5)) -eq 0 ]; then
echo "Still waiting for API startup... ($tries/$MAX_TRIES seconds)"
fi
done
# Show log content regardless of outcome
echo "docling-serve log output so far:"
tail -n 20 "$API_DIR/docling-serve.log"
# Verify the API is running
if ! ps -p $API_PID > /dev/null; then
echo "ERROR: docling-serve API failed to start"
if [ -f "$API_DIR/docling-serve.log" ]; then
echo "Full log output:"
cat "$API_DIR/docling-serve.log"
fi
exit $ERR_API_UNAVAILABLE
fi
if [ "$started" != "true" ]; then
echo "WARNING: API process is running but startup completion was not detected"
echo "Will attempt to continue anyway..."
fi
# Try to verify API is responding at this point
echo "Verifying API responsiveness..."
(python -c "
import sys, time, socket
for i in range(5):
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.settimeout(1)
result = s.connect_ex(('localhost', 5001))
if result == 0:
s.close()
print('Port 5001 is open and accepting connections')
sys.exit(0)
s.close()
except Exception as e:
pass
time.sleep(1)
print('Could not connect to API port after 5 attempts')
sys.exit(1)
" && echo "API verification succeeded") || echo "API verification failed, but continuing anyway"
# Define API endpoint
DOCLING_API_ENDPOINT="http://localhost:5001/v1alpha/convert/source"
# --- Processing document ---
echo "Starting document processing..."
echo "Reading input from Apify..."
echo "Input content:" >&2
echo "$INPUT" >&2 # Send the raw input to stderr for debugging
echo "$INPUT" # Send the clean JSON to stdout for processing
# Create the request JSON
REQUEST_JSON=$(echo $INPUT | jq '.options += {"return_as_file": true}')
echo "Creating request JSON:" >&2
echo "$REQUEST_JSON" >&2
echo "$REQUEST_JSON" > "$API_DIR/request.json"
# Send the conversion request using our Python script
#echo "Sending conversion request to docling-serve API..."
#python "$TOOLS_DIR/docling_processor.py" \
# --api-endpoint "$DOCLING_API_ENDPOINT" \
# --request-json "$API_DIR/request.json" \
# --output-dir "$API_DIR" \
# --output-format "$OUTPUT_FORMAT"
echo "Curl the Docling API"
curl -s -H "content-type: application/json" -X POST --data-binary @$API_DIR/request.json -o $API_DIR/output.zip $DOCLING_API_ENDPOINT
CURL_EXIT_CODE=$?
# --- Check for various potential output files ---
echo "Checking for output files..."
if [ -f "$API_DIR/output.zip" ]; then
echo "Conversion completed successfully! Output file found."
# Get content from the converted file
OUTPUT_SIZE=$(wc -c < "$API_DIR/output.zip")
echo "Output file found with size: $OUTPUT_SIZE bytes"
# Calculate the access URL for result display
RESULT_URL="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/OUTPUT"
echo "=============================="
echo "PROCESSING COMPLETE!"
echo "Output size: ${OUTPUT_SIZE} bytes"
echo "=============================="
# Set the output content type based on format
CONTENT_TYPE="application/zip"
# Upload the document content using our function
upload_to_kvs "$API_DIR/output.zip" "OUTPUT" "$CONTENT_TYPE" "Document content"
# Only proceed with dataset record if document upload succeeded
if [ $? -eq 0 ]; then
echo "Your document is available at: ${RESULT_URL}"
echo "=============================="
# Push data to dataset
push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
fi
else
echo "ERROR: No converted output file found at $API_DIR/output.zip"
# Create error metadata
ERROR_METADATA="{\"status\":\"error\",\"error\":\"No converted output file found\",\"documentUrl\":\"$DOCUMENT_URL\"}"
echo "$ERROR_METADATA" > "/tmp/actor-output/OUTPUT"
chmod 644 "/tmp/actor-output/OUTPUT"
echo "Error information has been saved to /tmp/actor-output/OUTPUT"
fi
# --- Verify output files for debugging ---
echo "=== Final Output Verification ==="
echo "Files in /tmp/actor-output:"
ls -la /tmp/actor-output/ 2>/dev/null || echo "Cannot list /tmp/actor-output/"
echo "All operations completed. The output should be available in the default key-value store."
echo "Content URL: ${RESULT_URL:-No URL available}"
# --- Cleanup function ---
cleanup() {
echo "Running cleanup..."
# Stop the API process
if [ -n "$API_PID" ]; then
echo "Stopping docling-serve API (PID: $API_PID)..."
kill $API_PID 2>/dev/null || true
fi
# Export log file to KVS if it exists
# DO THIS BEFORE REMOVING TOOLS DIRECTORY
if [ -f "$LOG_FILE" ]; then
if [ -s "$LOG_FILE" ]; then
echo "Log file is not empty, pushing to key-value store (key: LOG)..."
# Upload log using our function
upload_to_kvs "$LOG_FILE" "LOG" "text/plain" "Log file"
else
echo "Warning: log file exists but is empty"
fi
else
echo "Warning: No log file found"
fi
# Clean up temporary files AFTER log is uploaded
echo "Cleaning up temporary files..."
if [ -d "$API_DIR" ]; then
echo "Removing API working directory: $API_DIR"
rm -rf "$API_DIR" 2>/dev/null || echo "Warning: Failed to remove $API_DIR"
fi
if [ -d "$TOOLS_DIR" ]; then
echo "Removing tools directory: $TOOLS_DIR"
rm -rf "$TOOLS_DIR" 2>/dev/null || echo "Warning: Failed to remove $TOOLS_DIR"
fi
# Keep log file until the very end
echo "Script execution completed at $(date)"
echo "Actor execution completed"
}
# Register cleanup
trap cleanup EXIT

View File

@ -0,0 +1,31 @@
{
"title": "Docling Actor Dataset",
"description": "Records of document processing results from the Docling Actor",
"type": "object",
"schemaVersion": 1,
"properties": {
"url": {
"title": "Document URL",
"type": "string",
"description": "URL of the processed document"
},
"output_file": {
"title": "Result URL",
"type": "string",
"description": "Direct URL to the processed result in key-value store"
},
"status": {
"title": "Processing Status",
"type": "string",
"description": "Status of the document processing",
"enum": ["success", "error"]
},
"error": {
"title": "Error Details",
"type": "string",
"description": "Error message if processing failed",
"optional": true
}
},
"required": ["url", "output_file", "status"]
}

27
.actor/input_schema.json Normal file
View File

@ -0,0 +1,27 @@
{
"title": "Docling Actor Input",
"description": "Options for processing documents with Docling via the docling-serve API.",
"type": "object",
"schemaVersion": 1,
"properties": {
"http_sources": {
"title": "Document URLs",
"type": "array",
"description": "URLs of documents to process. Supported formats: PDF, DOCX, PPTX, XLSX, HTML, MD, XML, images, and more.",
"editor": "json",
"prefill": [
{ "url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf" }
]
},
"options": {
"title": "Processing Options",
"type": "object",
"description": "Document processing configuration options",
"editor": "json",
"prefill": {
"to_formats": ["md"]
}
}
},
"required": ["options", "http_sources"]
}

View File

@ -1,3 +1,40 @@
## [v2.28.0](https://github.com/docling-project/docling/releases/tag/v2.28.0) - 2025-03-19
### Feature
* **SmolDocling:** Support MLX acceleration in VLM pipeline ([#1199](https://github.com/docling-project/docling/issues/1199)) ([`1c26769`](https://github.com/docling-project/docling/commit/1c26769785bcd17c0b8b621c5182ad81134d3915))
* Add PPTX notes slides ([#474](https://github.com/docling-project/docling/issues/474)) ([`b454aa1`](https://github.com/docling-project/docling/commit/b454aa1551b891644ce4028ed2d7ec8f82c167ab))
* Updated vlm pipeline (with latest changes from docling-core) ([#1158](https://github.com/docling-project/docling/issues/1158)) ([`2f72167`](https://github.com/docling-project/docling/commit/2f72167ff6421424dea4d93018b0d43af16ec153))
### Fix
* Determine correct page size in DoclingParseV4Backend ([#1196](https://github.com/docling-project/docling/issues/1196)) ([`f5adfb9`](https://github.com/docling-project/docling/commit/f5adfb9724aae1207f23e21d74033f331e6e1ffb))
* **msword:** Fixing function return in equations handling ([#1194](https://github.com/docling-project/docling/issues/1194)) ([`0b707d0`](https://github.com/docling-project/docling/commit/0b707d0882f5be42505871799387d0b1882bffbf))
### Documentation
* Linux Foundation AI & Data ([#1183](https://github.com/docling-project/docling/issues/1183)) ([`1d680b0`](https://github.com/docling-project/docling/commit/1d680b0a321d95fc6bd65b7bb4d5e15005a0250a))
* Move apify to docs ([#1182](https://github.com/docling-project/docling/issues/1182)) ([`54a78c3`](https://github.com/docling-project/docling/commit/54a78c307de833b93f9b84cf1f8ed6dace8573cb))
## [v2.27.0](https://github.com/docling-project/docling/releases/tag/v2.27.0) - 2025-03-18
### Feature
* Add factory for ocr engines via plugins ([#1010](https://github.com/docling-project/docling/issues/1010)) ([`6eaae3c`](https://github.com/docling-project/docling/commit/6eaae3cba034599020dc06ebdad3bc3ff0b5a8eb))
* Add DoclingParseV4 backend, using high-level docling-parse API ([#905](https://github.com/docling-project/docling/issues/905)) ([`3960b19`](https://github.com/docling-project/docling/commit/3960b199d63d0e9d660aeb0cbced02b38bb0b593))
* **actor:** Docling Actor on Apify infrastructure ([#875](https://github.com/docling-project/docling/issues/875)) ([`772487f`](https://github.com/docling-project/docling/commit/772487f9c91ad2ee53c591c314c72443f9cbfd23))
* Equations to latex in MSWord backend (with inline groups) ([#1114](https://github.com/docling-project/docling/issues/1114)) ([`6eb718f`](https://github.com/docling-project/docling/commit/6eb718f8493038d1b4b6ae836df5a24aa13cd17e))
### Fix
* **html:** Handle nested empty lists ([#1154](https://github.com/docling-project/docling/issues/1154)) ([`f94da44`](https://github.com/docling-project/docling/commit/f94da44ec5c7a8c92b9dd60e4df5dc945ed6d1ea))
* Use first table row as col headers ([#1156](https://github.com/docling-project/docling/issues/1156)) ([`0945973`](https://github.com/docling-project/docling/commit/0945973b79d67b74281aba5102ee985ac1de74ea))
* Pass tests, update docling-core to 2.22.0 ([#1150](https://github.com/docling-project/docling/issues/1150)) ([`aa92a57`](https://github.com/docling-project/docling/commit/aa92a57fa9e7228e894efb9050a0cdb9f287ebfd))
### Documentation
* Fix spelling of picture in usage ([#1165](https://github.com/docling-project/docling/issues/1165)) ([`7e01798`](https://github.com/docling-project/docling/commit/7e01798417c424c05685e0ff5f6f89f70dc3bfcd))
## [v2.26.0](https://github.com/docling-project/docling/releases/tag/v2.26.0) - 2025-03-11
### Feature

View File

@ -1,129 +1,3 @@
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement using
[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
[https://www.contributor-covenant.org/version/2/0/code_of_conduct.html](https://www.contributor-covenant.org/version/2/0/code_of_conduct.html).
Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).
Homepage: [https://www.contributor-covenant.org](https://www.contributor-covenant.org)
For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq](https://www.contributor-covenant.org/faq). Translations are available at
[https://www.contributor-covenant.org/translations](https://www.contributor-covenant.org/translations).
This project adheres to the [Docling - Code of Conduct and Covenant](https://github.com/docling-project/community/blob/main/CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code.

View File

@ -2,85 +2,7 @@
Our project welcomes external contributions. If you have an itch, please feel
free to scratch it.
To contribute code or documentation, please submit a [pull request](https://github.com/docling-project/docling/pulls).
A good way to familiarize yourself with the codebase and contribution process is
to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/docling-project/docling/issues).
Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us.
For general questions or support requests, please refer to the [discussion section](https://github.com/docling-project/docling/discussions).
**Note: We appreciate your effort and want to avoid situations where a contribution
requires extensive rework (by you or by us), sits in the backlog for a long time, or
cannot be accepted at all!**
### Proposing New Features
If you would like to implement a new feature, please [raise an issue](https://github.com/docling-project/docling/issues)
before sending a pull request so the feature can be discussed. This is to avoid
you spending valuable time working on a feature that the project developers
are not interested in accepting into the codebase.
### Fixing Bugs
If you would like to fix a bug, please [raise an issue](https://github.com/docling-project/docling/issues) before sending a
pull request so it can be tracked.
### Merge Approval
The project maintainers use LGTM (Looks Good To Me) in comments on the code
review to indicate acceptance. A change requires LGTMs from two of the
maintainers of each component affected.
For a list of the maintainers, see the [MAINTAINERS.md](MAINTAINERS.md) page.
## Legal
Each source file must include a license header for the MIT
Software. Using the SPDX format is the simplest approach,
e.g.
```
/*
Copyright IBM Inc. All rights reserved.
SPDX-License-Identifier: MIT
*/
```
We have tried to make it as easy as possible to make contributions. This
applies to how we handle the legal aspects of contribution. We use the
same approach - the [Developer's Certificate of Origin 1.1 (DCO)](https://github.com/hyperledger/fabric/blob/master/docs/source/DCO1.1.txt) - that the Linux® Kernel [community](https://elinux.org/Developer_Certificate_Of_Origin)
uses to manage code contributions.
We simply ask that when submitting a patch for review, the developer
must include a sign-off statement in the commit message.
Here is an example Signed-off-by line, which indicates that the
submitter accepts the DCO:
```
Signed-off-by: John Doe <john.doe@example.com>
```
You can include this automatically when you commit a change to your
local git repository using the following command:
```
git commit -s
```
### New dependencies
This project strictly adheres to using dependencies that are compatible with the MIT license to ensure maximum flexibility and permissiveness in its usage and distribution. As a result, dependencies licensed under restrictive terms such as GPL, LGPL, AGPL, or similar are explicitly excluded. These licenses impose additional requirements and limitations that are incompatible with the MIT license's minimal restrictions, potentially affecting derivative works and redistribution. By maintaining this policy, the project ensures simplicity and freedom for both developers and users, avoiding conflicts with stricter copyleft provisions.
## Communication
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
For more details on the contributing guidelines head to the Docling Project [community repository](https://github.com/docling-project/community).
## Developing

View File

@ -4,7 +4,7 @@ ENV GIT_SSH_COMMAND="ssh -o StrictHostKeyChecking=no"
RUN apt-get update \
&& apt-get install -y libgl1 libglib2.0-0 curl wget git procps \
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*
# This will install torch with *only* cpu support
# Remove the --extra-index-url part if you want to install all the gpu requirements

View File

@ -2,9 +2,6 @@
- Christoph Auer - [@cau-git](https://github.com/cau-git)
- Michele Dolfi - [@dolfim-ibm](https://github.com/dolfim-ibm)
- Maxim Lysak - [@maxmnemonic](https://github.com/maxmnemonic)
- Nikos Livathinos - [@nikos-livathinos](https://github.com/nikos-livathinos)
- Ahmed Nassar - [@nassarofficial](https://github.com/nassarofficial)
- Panos Vagenas - [@vagenas](https://github.com/vagenas)
- Peter Staar - [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)

View File

@ -21,6 +21,8 @@
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
[![Docling Actor](https://apify.com/actor-badge?actor=vancura/docling?fpr=docling)](https://apify.com/vancura/docling)
[![LF AI & Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation&logoColor=fff&color=0094ff&labelColor=003778)](https://lfaidata.foundation/projects/)
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
@ -33,12 +35,12 @@ Docling simplifies document processing, parsing diverse formats — including ad
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 Extensive OCR support for scanned PDFs and images
* 🥚 Support of Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🆕
* 💻 Simple and convenient CLI
### Coming soon
* 📝 Metadata extraction, including title, authors, references & language
* 📝 Inclusion of Visual Language Models ([SmolDocling](https://huggingface.co/blog/smolervlm#smoldocling))
* 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
* 📝 Complex chemistry understanding (Molecular structures)
@ -55,7 +57,7 @@ More [detailed installation instructions](https://docling-project.github.io/docl
## Getting started
To convert individual documents, use `convert()`, for example:
To convert individual documents with python, use `convert()`, for example:
```python
from docling.document_converter import DocumentConverter
@ -69,6 +71,22 @@ print(result.document.export_to_markdown()) # output: "## Docling Technical Rep
More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
the docs.
## CLI
Docling has a built-in CLI to run conversions.
```bash
docling https://arxiv.org/pdf/2206.01062
```
You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
```bash
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
```
This will use MLX acceleration on supported Apple Silicon hardware.
Read more [here](https://docling-project.github.io/docling/usage/)
## Documentation
Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
@ -119,9 +137,13 @@ If you use Docling in your projects, please consider citing the following:
The Docling codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.
## IBM ❤️ Open Source AI
## LF AI & Data
Docling has been brought to you by IBM.
Docling is hosted as a project in the [LF AI & Data Foundation](https://lfaidata.foundation/projects/).
### IBM ❤️ Open Source AI
The project was started by the AI for knowledge team at IBM Research Zurich.
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/

View File

@ -112,23 +112,30 @@ class DoclingParseV4PageBackend(PdfPageBackend):
padbox.r = page_size.width - padbox.r
padbox.t = page_size.height - padbox.t
image = (
self._ppage.render(
scale=scale * 1.5,
rotation=0, # no additional rotation
crop=padbox.as_tuple(),
)
.to_pil()
.resize(size=(round(cropbox.width * scale), round(cropbox.height * scale)))
) # We resize the image from 1.5x the given scale to make it sharper.
with pypdfium2_lock:
image = (
self._ppage.render(
scale=scale * 1.5,
rotation=0, # no additional rotation
crop=padbox.as_tuple(),
)
.to_pil()
.resize(
size=(round(cropbox.width * scale), round(cropbox.height * scale))
)
) # We resize the image from 1.5x the given scale to make it sharper.
return image
def get_size(self) -> Size:
return Size(
width=self._dpage.dimension.width,
height=self._dpage.dimension.height,
)
with pypdfium2_lock:
return Size(width=self._ppage.get_width(), height=self._ppage.get_height())
# TODO: Take width and height from docling-parse.
# return Size(
# width=self._dpage.dimension.width,
# height=self._dpage.dimension.height,
# )
def unload(self):
self._ppage = None

View File

@ -16,6 +16,7 @@ from docling_core.types.doc import (
TableCell,
TableData,
)
from docling_core.types.doc.document import ContentLayer
from PIL import Image, UnidentifiedImageError
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE, PP_PLACEHOLDER
@ -421,4 +422,21 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
for shape in slide.shapes:
handle_shapes(shape, parent_slide, slide_ind, doc, slide_size)
# Handle notes slide
if slide.has_notes_slide:
notes_slide = slide.notes_slide
notes_text = notes_slide.notes_text_frame.text.strip()
if notes_text:
bbox = BoundingBox(l=0, t=0, r=0, b=0)
prov = ProvenanceItem(
page_no=slide_ind + 1, charspan=[0, len(notes_text)], bbox=bbox
)
doc.add_text(
label=DocItemLabel.TEXT,
parent=parent_slide,
text=notes_text,
prov=prov,
content_layer=ContentLayer.FURNITURE,
)
return doc

View File

@ -53,6 +53,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
self.max_levels: int = 10
self.level_at_new_list: Optional[int] = None
self.parents: dict[int, Optional[NodeItem]] = {}
self.numbered_headers: dict[int, int] = {}
for i in range(-1, self.max_levels):
self.parents[i] = None
@ -275,8 +276,10 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
only_equations.append(latex_equation)
texts_and_equations.append(latex_equation)
if "".join(only_texts) != text:
return text
if "".join(only_texts).strip() != text.strip():
# If we are not able to reconstruct the initial raw text
# do not try to parse equations and return the original
return text, []
return "".join(texts_and_equations), only_equations
@ -344,7 +347,14 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
parent=None, label=DocItemLabel.TITLE, text=text
)
elif "Heading" in p_style_id:
self.add_header(doc, p_level, text)
style_element = getattr(paragraph.style, "element", None)
if style_element:
is_numbered_style = (
"<w:numPr>" in style_element.xml or "<w:numPr>" in element.xml
)
else:
is_numbered_style = False
self.add_header(doc, p_level, text, is_numbered_style)
elif len(equations) > 0:
if (raw_text is None or len(raw_text) == 0) and len(text) > 0:
@ -365,6 +375,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
for eq in equations:
if len(text_tmp) == 0:
break
pre_eq_text = text_tmp.split(eq, maxsplit=1)[0]
text_tmp = text_tmp.split(eq, maxsplit=1)[1]
if len(pre_eq_text) > 0:
@ -412,7 +423,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
return
def add_header(
self, doc: DoclingDocument, curr_level: Optional[int], text: str
self,
doc: DoclingDocument,
curr_level: Optional[int],
text: str,
is_numbered_style: bool = False,
) -> None:
level = self.get_level()
if isinstance(curr_level, int):
@ -430,17 +445,44 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if key >= curr_level:
self.parents[key] = None
self.parents[curr_level] = doc.add_heading(
parent=self.parents[curr_level - 1],
text=text,
level=curr_level,
)
current_level = curr_level
parent_level = curr_level - 1
add_level = curr_level
else:
self.parents[self.level] = doc.add_heading(
parent=self.parents[self.level - 1],
text=text,
level=1,
)
current_level = self.level
parent_level = self.level - 1
add_level = 1
if is_numbered_style:
if add_level in self.numbered_headers:
self.numbered_headers[add_level] += 1
else:
self.numbered_headers[add_level] = 1
text = f"{self.numbered_headers[add_level]} {text}"
# Reset deeper levels
next_level = add_level + 1
while next_level in self.numbered_headers:
self.numbered_headers[next_level] = 0
next_level += 1
# Scan upper levels
previous_level = add_level - 1
while previous_level in self.numbered_headers:
# MSWord convention: no empty sublevels
# I.e., sub-sub section (2.0.1) without a sub-section (2.1)
# is processed as 2.1.1
if self.numbered_headers[previous_level] == 0:
self.numbered_headers[previous_level] += 1
text = f"{self.numbered_headers[previous_level]}.{text}"
previous_level -= 1
self.parents[current_level] = doc.add_heading(
parent=self.parents[parent_level],
text=text,
level=add_level,
)
return
def add_listitem(

View File

@ -9,6 +9,7 @@ import warnings
from pathlib import Path
from typing import Annotated, Dict, Iterable, List, Optional, Type
import rich.table
import typer
from docling_core.types.doc import ImageRefMode
from docling_core.utils.file import resolve_source_to_path
@ -30,18 +31,22 @@ from docling.datamodel.pipeline_options import (
AcceleratorDevice,
AcceleratorOptions,
EasyOcrOptions,
OcrEngine,
OcrMacOptions,
OcrOptions,
PaginatedPipelineOptions,
PdfBackend,
PdfPipeline,
PdfPipelineOptions,
RapidOcrOptions,
TableFormerMode,
TesseractCliOcrOptions,
TesseractOcrOptions,
VlmModelType,
VlmPipelineOptions,
granite_vision_vlm_conversion_options,
smoldocling_vlm_conversion_options,
smoldocling_vlm_mlx_conversion_options,
)
from docling.datamodel.settings import settings
from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption
from docling.models.factories import get_ocr_factory
from docling.pipeline.vlm_pipeline import VlmPipeline
warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
warnings.filterwarnings(action="ignore", category=FutureWarning, module="easyocr")
@ -49,8 +54,11 @@ warnings.filterwarnings(action="ignore", category=FutureWarning, module="easyocr
_log = logging.getLogger(__name__)
from rich.console import Console
console = Console()
err_console = Console(stderr=True)
ocr_factory_internal = get_ocr_factory(allow_external_plugins=False)
ocr_engines_enum_internal = ocr_factory_internal.get_enum()
app = typer.Typer(
name="Docling",
@ -78,6 +86,24 @@ def version_callback(value: bool):
raise typer.Exit()
def show_external_plugins_callback(value: bool):
if value:
ocr_factory_all = get_ocr_factory(allow_external_plugins=True)
table = rich.table.Table(title="Available OCR engines")
table.add_column("Name", justify="right")
table.add_column("Plugin")
table.add_column("Package")
for meta in ocr_factory_all.registered_meta.values():
if not meta.module.startswith("docling."):
table.add_row(
f"[bold]{meta.kind}[/bold]",
meta.plugin_name,
meta.module.split(".")[0],
)
rich.print(table)
raise typer.Exit()
def export_documents(
conv_results: Iterable[ConversionResult],
output_dir: Path,
@ -182,6 +208,14 @@ def convert(
help="Image export mode for the document (only in case of JSON, Markdown or HTML). With `placeholder`, only the position of the image is marked in the output. In `embedded` mode, the image is embedded as base64 encoded string. In `referenced` mode, the image is exported in PNG format and referenced from the main exported document.",
),
] = ImageRefMode.EMBEDDED,
pipeline: Annotated[
PdfPipeline,
typer.Option(..., help="Choose the pipeline to process PDF or image files."),
] = PdfPipeline.STANDARD,
vlm_model: Annotated[
VlmModelType,
typer.Option(..., help="Choose the VLM model to use with PDF or image files."),
] = VlmModelType.SMOLDOCLING,
ocr: Annotated[
bool,
typer.Option(
@ -196,8 +230,16 @@ def convert(
),
] = False,
ocr_engine: Annotated[
OcrEngine, typer.Option(..., help="The OCR engine to use.")
] = OcrEngine.EASYOCR,
str,
typer.Option(
...,
help=(
f"The OCR engine to use. When --allow-external-plugins is *not* set, the available values are: "
f"{', '.join((o.value for o in ocr_engines_enum_internal))}. "
f"Use the option --show-external-plugins to see the options allowed with external plugins."
),
),
] = EasyOcrOptions.kind,
ocr_lang: Annotated[
Optional[str],
typer.Option(
@ -241,6 +283,21 @@ def convert(
..., help="Must be enabled when using models connecting to remote services."
),
] = False,
allow_external_plugins: Annotated[
bool,
typer.Option(
..., help="Must be enabled for loading modules from third-party plugins."
),
] = False,
show_external_plugins: Annotated[
bool,
typer.Option(
...,
help="List the third-party plugins which are available when the option --allow-external-plugins is set.",
callback=show_external_plugins_callback,
is_eager=True,
),
] = False,
abort_on_error: Annotated[
bool,
typer.Option(
@ -368,67 +425,88 @@ def convert(
export_txt = OutputFormat.TEXT in to_formats
export_doctags = OutputFormat.DOCTAGS in to_formats
if ocr_engine == OcrEngine.EASYOCR:
ocr_options: OcrOptions = EasyOcrOptions(force_full_page_ocr=force_ocr)
elif ocr_engine == OcrEngine.TESSERACT_CLI:
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=force_ocr)
elif ocr_engine == OcrEngine.TESSERACT:
ocr_options = TesseractOcrOptions(force_full_page_ocr=force_ocr)
elif ocr_engine == OcrEngine.OCRMAC:
ocr_options = OcrMacOptions(force_full_page_ocr=force_ocr)
elif ocr_engine == OcrEngine.RAPIDOCR:
ocr_options = RapidOcrOptions(force_full_page_ocr=force_ocr)
else:
raise RuntimeError(f"Unexpected OCR engine type {ocr_engine}")
ocr_factory = get_ocr_factory(allow_external_plugins=allow_external_plugins)
ocr_options: OcrOptions = ocr_factory.create_options( # type: ignore
kind=ocr_engine,
force_full_page_ocr=force_ocr,
)
ocr_lang_list = _split_list(ocr_lang)
if ocr_lang_list is not None:
ocr_options.lang = ocr_lang_list
accelerator_options = AcceleratorOptions(num_threads=num_threads, device=device)
pipeline_options = PdfPipelineOptions(
enable_remote_services=enable_remote_services,
accelerator_options=accelerator_options,
do_ocr=ocr,
ocr_options=ocr_options,
do_table_structure=True,
do_code_enrichment=enrich_code,
do_formula_enrichment=enrich_formula,
do_picture_description=enrich_picture_description,
do_picture_classification=enrich_picture_classes,
document_timeout=document_timeout,
)
pipeline_options.table_structure_options.do_cell_matching = (
True # do_cell_matching
)
pipeline_options.table_structure_options.mode = table_mode
pipeline_options: PaginatedPipelineOptions
if image_export_mode != ImageRefMode.PLACEHOLDER:
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = (
True # FIXME: to be deprecated in verson 3
if pipeline == PdfPipeline.STANDARD:
pipeline_options = PdfPipelineOptions(
allow_external_plugins=allow_external_plugins,
enable_remote_services=enable_remote_services,
accelerator_options=accelerator_options,
do_ocr=ocr,
ocr_options=ocr_options,
do_table_structure=True,
do_code_enrichment=enrich_code,
do_formula_enrichment=enrich_formula,
do_picture_description=enrich_picture_description,
do_picture_classification=enrich_picture_classes,
document_timeout=document_timeout,
)
pipeline_options.table_structure_options.do_cell_matching = (
True # do_cell_matching
)
pipeline_options.table_structure_options.mode = table_mode
if image_export_mode != ImageRefMode.PLACEHOLDER:
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = (
True # FIXME: to be deprecated in verson 3
)
pipeline_options.images_scale = 2
backend: Type[PdfDocumentBackend]
if pdf_backend == PdfBackend.DLPARSE_V1:
backend = DoclingParseDocumentBackend
elif pdf_backend == PdfBackend.DLPARSE_V2:
backend = DoclingParseV2DocumentBackend
elif pdf_backend == PdfBackend.DLPARSE_V4:
backend = DoclingParseV4DocumentBackend # type: ignore
elif pdf_backend == PdfBackend.PYPDFIUM2:
backend = PyPdfiumDocumentBackend # type: ignore
else:
raise RuntimeError(f"Unexpected PDF backend type {pdf_backend}")
pdf_format_option = PdfFormatOption(
pipeline_options=pipeline_options,
backend=backend, # pdf_backend
)
elif pipeline == PdfPipeline.VLM:
pipeline_options = VlmPipelineOptions()
if vlm_model == VlmModelType.GRANITE_VISION:
pipeline_options.vlm_options = granite_vision_vlm_conversion_options
elif vlm_model == VlmModelType.SMOLDOCLING:
pipeline_options.vlm_options = smoldocling_vlm_conversion_options
if sys.platform == "darwin":
try:
import mlx_vlm
pipeline_options.vlm_options = (
smoldocling_vlm_mlx_conversion_options
)
except ImportError:
_log.warning(
"To run SmolDocling faster, please install mlx-vlm:\n"
"pip install mlx-vlm"
)
pdf_format_option = PdfFormatOption(
pipeline_cls=VlmPipeline, pipeline_options=pipeline_options
)
pipeline_options.images_scale = 2
if artifacts_path is not None:
pipeline_options.artifacts_path = artifacts_path
backend: Type[PdfDocumentBackend]
if pdf_backend == PdfBackend.DLPARSE_V1:
backend = DoclingParseDocumentBackend
elif pdf_backend == PdfBackend.DLPARSE_V2:
backend = DoclingParseV2DocumentBackend
elif pdf_backend == PdfBackend.DLPARSE_V4:
backend = DoclingParseV4DocumentBackend # type: ignore
elif pdf_backend == PdfBackend.PYPDFIUM2:
backend = PyPdfiumDocumentBackend # type: ignore
else:
raise RuntimeError(f"Unexpected PDF backend type {pdf_backend}")
pdf_format_option = PdfFormatOption(
pipeline_options=pipeline_options,
backend=backend, # pdf_backend
)
format_options: Dict[InputFormat, FormatOption] = {
InputFormat.PDF: pdf_format_option,
InputFormat.IMAGE: pdf_format_option,

View File

@ -1,10 +1,9 @@
import logging
import os
import re
import warnings
from enum import Enum
from pathlib import Path
from typing import Annotated, Any, Dict, List, Literal, Optional, Union
from typing import Any, ClassVar, Dict, List, Literal, Optional, Union
from pydantic import (
AnyUrl,
@ -13,13 +12,8 @@ from pydantic import (
Field,
field_validator,
model_validator,
validator,
)
from pydantic_settings import (
BaseSettings,
PydanticBaseSettingsSource,
SettingsConfigDict,
)
from pydantic_settings import BaseSettings, SettingsConfigDict
from typing_extensions import deprecated
_log = logging.getLogger(__name__)
@ -83,6 +77,12 @@ class AcceleratorOptions(BaseSettings):
return data
class BaseOptions(BaseModel):
"""Base class for options."""
kind: ClassVar[str]
class TableFormerMode(str, Enum):
"""Modes for the TableFormer model."""
@ -102,10 +102,9 @@ class TableStructureOptions(BaseModel):
mode: TableFormerMode = TableFormerMode.ACCURATE
class OcrOptions(BaseModel):
class OcrOptions(BaseOptions):
"""OCR options."""
kind: str
lang: List[str]
force_full_page_ocr: bool = False # If enabled a full page OCR is always applied
bitmap_area_threshold: float = (
@ -116,7 +115,7 @@ class OcrOptions(BaseModel):
class RapidOcrOptions(OcrOptions):
"""Options for the RapidOCR engine."""
kind: Literal["rapidocr"] = "rapidocr"
kind: ClassVar[Literal["rapidocr"]] = "rapidocr"
# English and chinese are the most commly used models and have been tested with RapidOCR.
lang: List[str] = [
@ -155,7 +154,7 @@ class RapidOcrOptions(OcrOptions):
class EasyOcrOptions(OcrOptions):
"""Options for the EasyOCR engine."""
kind: Literal["easyocr"] = "easyocr"
kind: ClassVar[Literal["easyocr"]] = "easyocr"
lang: List[str] = ["fr", "de", "es", "en"]
use_gpu: Optional[bool] = None
@ -175,7 +174,7 @@ class EasyOcrOptions(OcrOptions):
class TesseractCliOcrOptions(OcrOptions):
"""Options for the TesseractCli engine."""
kind: Literal["tesseract"] = "tesseract"
kind: ClassVar[Literal["tesseract"]] = "tesseract"
lang: List[str] = ["fra", "deu", "spa", "eng"]
tesseract_cmd: str = "tesseract"
path: Optional[str] = None
@ -188,7 +187,7 @@ class TesseractCliOcrOptions(OcrOptions):
class TesseractOcrOptions(OcrOptions):
"""Options for the Tesseract engine."""
kind: Literal["tesserocr"] = "tesserocr"
kind: ClassVar[Literal["tesserocr"]] = "tesserocr"
lang: List[str] = ["fra", "deu", "spa", "eng"]
path: Optional[str] = None
@ -200,7 +199,7 @@ class TesseractOcrOptions(OcrOptions):
class OcrMacOptions(OcrOptions):
"""Options for the Mac OCR engine."""
kind: Literal["ocrmac"] = "ocrmac"
kind: ClassVar[Literal["ocrmac"]] = "ocrmac"
lang: List[str] = ["fr-FR", "de-DE", "es-ES", "en-US"]
recognition: str = "accurate"
framework: str = "vision"
@ -210,8 +209,7 @@ class OcrMacOptions(OcrOptions):
)
class PictureDescriptionBaseOptions(BaseModel):
kind: str
class PictureDescriptionBaseOptions(BaseOptions):
batch_size: int = 8
scale: float = 2
@ -221,7 +219,7 @@ class PictureDescriptionBaseOptions(BaseModel):
class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):
kind: Literal["api"] = "api"
kind: ClassVar[Literal["api"]] = "api"
url: AnyUrl = AnyUrl("http://localhost:8000/v1/chat/completions")
headers: Dict[str, str] = {}
@ -233,7 +231,7 @@ class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):
class PictureDescriptionVlmOptions(PictureDescriptionBaseOptions):
kind: Literal["vlm"] = "vlm"
kind: ClassVar[Literal["vlm"]] = "vlm"
repo_id: str
prompt: str = "Describe this image in a few sentences."
@ -265,6 +263,11 @@ class ResponseFormat(str, Enum):
MARKDOWN = "markdown"
class InferenceFramework(str, Enum):
MLX = "mlx"
TRANSFORMERS = "transformers"
class HuggingFaceVlmOptions(BaseVlmOptions):
kind: Literal["hf_model_options"] = "hf_model_options"
@ -273,6 +276,7 @@ class HuggingFaceVlmOptions(BaseVlmOptions):
llm_int8_threshold: float = 6.0
quantized: bool = False
inference_framework: InferenceFramework
response_format: ResponseFormat
@property
@ -280,10 +284,19 @@ class HuggingFaceVlmOptions(BaseVlmOptions):
return self.repo_id.replace("/", "--")
smoldocling_vlm_mlx_conversion_options = HuggingFaceVlmOptions(
repo_id="ds4sd/SmolDocling-256M-preview-mlx-bf16",
prompt="Convert this page to docling.",
response_format=ResponseFormat.DOCTAGS,
inference_framework=InferenceFramework.MLX,
)
smoldocling_vlm_conversion_options = HuggingFaceVlmOptions(
repo_id="ds4sd/SmolDocling-256M-preview",
prompt="Convert this page to docling.",
response_format=ResponseFormat.DOCTAGS,
inference_framework=InferenceFramework.TRANSFORMERS,
)
granite_vision_vlm_conversion_options = HuggingFaceVlmOptions(
@ -291,9 +304,15 @@ granite_vision_vlm_conversion_options = HuggingFaceVlmOptions(
# prompt="OCR the full page to markdown.",
prompt="OCR this image.",
response_format=ResponseFormat.MARKDOWN,
inference_framework=InferenceFramework.TRANSFORMERS,
)
class VlmModelType(str, Enum):
SMOLDOCLING = "smoldocling"
GRANITE_VISION = "granite_vision"
# Define an enum for the backend options
class PdfBackend(str, Enum):
"""Enum of valid PDF backends."""
@ -305,6 +324,7 @@ class PdfBackend(str, Enum):
# Define an enum for the ocr engines
@deprecated("Use ocr_factory.registered_enum")
class OcrEngine(str, Enum):
"""Enum of valid OCR engines."""
@ -324,16 +344,18 @@ class PipelineOptions(BaseModel):
document_timeout: Optional[float] = None
accelerator_options: AcceleratorOptions = AcceleratorOptions()
enable_remote_services: bool = False
allow_external_plugins: bool = False
class PaginatedPipelineOptions(PipelineOptions):
artifacts_path: Optional[Union[Path, str]] = None
images_scale: float = 1.0
generate_page_images: bool = False
generate_picture_images: bool = False
class VlmPipelineOptions(PaginatedPipelineOptions):
artifacts_path: Optional[Union[Path, str]] = None
generate_page_images: bool = True
force_backend_text: bool = (
@ -346,7 +368,6 @@ class VlmPipelineOptions(PaginatedPipelineOptions):
class PdfPipelineOptions(PaginatedPipelineOptions):
"""Options for the PDF pipeline."""
artifacts_path: Optional[Union[Path, str]] = None
do_table_structure: bool = True # True: perform table structure extraction
do_ocr: bool = True # True: perform OCR, replace programmatic PDF text
do_code_enrichment: bool = False # True: perform code OCR
@ -359,17 +380,10 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
# If True, text from backend will be used instead of generated text
table_structure_options: TableStructureOptions = TableStructureOptions()
ocr_options: Union[
EasyOcrOptions,
TesseractCliOcrOptions,
TesseractOcrOptions,
OcrMacOptions,
RapidOcrOptions,
] = Field(EasyOcrOptions(), discriminator="kind")
picture_description_options: Annotated[
Union[PictureDescriptionApiOptions, PictureDescriptionVlmOptions],
Field(discriminator="kind"),
] = smolvlm_picture_description
ocr_options: OcrOptions = EasyOcrOptions()
picture_description_options: PictureDescriptionBaseOptions = (
smolvlm_picture_description
)
images_scale: float = 1.0
generate_page_images: bool = False
@ -384,3 +398,8 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
)
generate_parsed_pages: bool = False
class PdfPipeline(str, Enum):
STANDARD = "standard"
VLM = "vlm"

View File

@ -1,3 +1,4 @@
import hashlib
import logging
import math
import sys
@ -181,7 +182,14 @@ class DocumentConverter:
)
for format in self.allowed_formats
}
self.initialized_pipelines: Dict[Type[BasePipeline], BasePipeline] = {}
self.initialized_pipelines: Dict[
Tuple[Type[BasePipeline], str], BasePipeline
] = {}
def _get_pipeline_options_hash(self, pipeline_options: PipelineOptions) -> str:
"""Generate a hash of pipeline options to use as part of the cache key."""
options_str = str(pipeline_options.model_dump())
return hashlib.md5(options_str.encode("utf-8")).hexdigest()
def initialize_pipeline(self, format: InputFormat):
"""Initialize the conversion pipeline for the selected format."""
@ -279,31 +287,36 @@ class DocumentConverter:
yield item
def _get_pipeline(self, doc_format: InputFormat) -> Optional[BasePipeline]:
"""Retrieve or initialize a pipeline, reusing instances based on class and options."""
fopt = self.format_to_options.get(doc_format)
if fopt is None:
if fopt is None or fopt.pipeline_options is None:
return None
else:
pipeline_class = fopt.pipeline_cls
pipeline_options = fopt.pipeline_options
if pipeline_options is None:
return None
# TODO this will ignore if different options have been defined for the same pipeline class.
if (
pipeline_class not in self.initialized_pipelines
or self.initialized_pipelines[pipeline_class].pipeline_options
!= pipeline_options
):
self.initialized_pipelines[pipeline_class] = pipeline_class(
pipeline_class = fopt.pipeline_cls
pipeline_options = fopt.pipeline_options
options_hash = self._get_pipeline_options_hash(pipeline_options)
# Use a composite key to cache pipelines
cache_key = (pipeline_class, options_hash)
if cache_key not in self.initialized_pipelines:
_log.info(
f"Initializing pipeline for {pipeline_class.__name__} with options hash {options_hash}"
)
self.initialized_pipelines[cache_key] = pipeline_class(
pipeline_options=pipeline_options
)
return self.initialized_pipelines[pipeline_class]
else:
_log.debug(
f"Reusing cached pipeline for {pipeline_class.__name__} with options hash {options_hash}"
)
return self.initialized_pipelines[cache_key]
def _process_document(
self, in_doc: InputDocument, raises_on_error: bool
) -> ConversionResult:
valid = (
self.allowed_formats is not None and in_doc.format in self.allowed_formats
)
@ -345,7 +358,6 @@ class DocumentConverter:
else:
if raises_on_error:
raise ConversionError(f"Input document {in_doc.file} is not valid.")
else:
# invalid doc or not of desired format
conv_res = ConversionResult(

View File

@ -1,14 +1,22 @@
from abc import ABC, abstractmethod
from typing import Any, Generic, Iterable, Optional
from typing import Any, Generic, Iterable, Optional, Protocol, Type
from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem
from typing_extensions import TypeVar
from docling.datamodel.base_models import ItemAndImageEnrichmentElement, Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import BaseOptions
from docling.datamodel.settings import settings
class BaseModelWithOptions(Protocol):
@classmethod
def get_options_type(cls) -> Type[BaseOptions]: ...
def __init__(self, *, options: BaseOptions, **kwargs): ...
class BasePageModel(ABC):
@abstractmethod
def __call__(

View File

@ -2,7 +2,7 @@ import copy
import logging
from abc import abstractmethod
from pathlib import Path
from typing import Iterable, List
from typing import Iterable, List, Optional, Type
import numpy as np
from docling_core.types.doc import BoundingBox, CoordOrigin
@ -13,15 +13,22 @@ from scipy.ndimage import binary_dilation, find_objects, label
from docling.datamodel.base_models import Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import OcrOptions
from docling.datamodel.pipeline_options import AcceleratorOptions, OcrOptions
from docling.datamodel.settings import settings
from docling.models.base_model import BasePageModel
from docling.models.base_model import BaseModelWithOptions, BasePageModel
_log = logging.getLogger(__name__)
class BaseOcrModel(BasePageModel):
def __init__(self, enabled: bool, options: OcrOptions):
class BaseOcrModel(BasePageModel, BaseModelWithOptions):
def __init__(
self,
*,
enabled: bool,
artifacts_path: Optional[Path],
options: OcrOptions,
accelerator_options: AcceleratorOptions,
):
self.enabled = enabled
self.options = options
@ -186,3 +193,8 @@ class BaseOcrModel(BasePageModel):
self, conv_res: ConversionResult, page_batch: Iterable[Page]
) -> Iterable[Page]:
pass
@classmethod
@abstractmethod
def get_options_type(cls) -> Type[OcrOptions]:
pass

View File

@ -2,7 +2,7 @@ import logging
import warnings
import zipfile
from pathlib import Path
from typing import Iterable, List, Optional
from typing import Iterable, List, Optional, Type
import numpy
from docling_core.types.doc import BoundingBox, CoordOrigin
@ -14,6 +14,7 @@ from docling.datamodel.pipeline_options import (
AcceleratorDevice,
AcceleratorOptions,
EasyOcrOptions,
OcrOptions,
)
from docling.datamodel.settings import settings
from docling.models.base_ocr_model import BaseOcrModel
@ -34,7 +35,12 @@ class EasyOcrModel(BaseOcrModel):
options: EasyOcrOptions,
accelerator_options: AcceleratorOptions,
):
super().__init__(enabled=enabled, options=options)
super().__init__(
enabled=enabled,
artifacts_path=artifacts_path,
options=options,
accelerator_options=accelerator_options,
)
self.options: EasyOcrOptions
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
@ -180,3 +186,7 @@ class EasyOcrModel(BaseOcrModel):
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
yield page
@classmethod
def get_options_type(cls) -> Type[OcrOptions]:
return EasyOcrOptions

View File

@ -0,0 +1,27 @@
import logging
from functools import lru_cache
from docling.models.factories.ocr_factory import OcrFactory
from docling.models.factories.picture_description_factory import (
PictureDescriptionFactory,
)
logger = logging.getLogger(__name__)
@lru_cache()
def get_ocr_factory(allow_external_plugins: bool = False) -> OcrFactory:
factory = OcrFactory()
factory.load_from_plugins(allow_external_plugins=allow_external_plugins)
logger.info("Registered ocr engines: %r", factory.registered_kind)
return factory
@lru_cache()
def get_picture_description_factory(
allow_external_plugins: bool = False,
) -> PictureDescriptionFactory:
factory = PictureDescriptionFactory()
factory.load_from_plugins(allow_external_plugins=allow_external_plugins)
logger.info("Registered picture descriptions: %r", factory.registered_kind)
return factory

View File

@ -0,0 +1,122 @@
import enum
import logging
from abc import ABCMeta
from typing import Generic, Optional, Type, TypeVar
from pluggy import PluginManager
from pydantic import BaseModel
from docling.datamodel.pipeline_options import BaseOptions
from docling.models.base_model import BaseModelWithOptions
A = TypeVar("A", bound=BaseModelWithOptions)
logger = logging.getLogger(__name__)
class FactoryMeta(BaseModel):
kind: str
plugin_name: str
module: str
class BaseFactory(Generic[A], metaclass=ABCMeta):
default_plugin_name = "docling"
def __init__(self, plugin_attr_name: str, plugin_name=default_plugin_name):
self.plugin_name = plugin_name
self.plugin_attr_name = plugin_attr_name
self._classes: dict[Type[BaseOptions], Type[A]] = {}
self._meta: dict[Type[BaseOptions], FactoryMeta] = {}
@property
def registered_kind(self) -> list[str]:
return list(opt.kind for opt in self._classes.keys())
def get_enum(self) -> enum.Enum:
return enum.Enum(
self.plugin_attr_name + "_enum",
names={kind: kind for kind in self.registered_kind},
type=str,
module=__name__,
)
@property
def classes(self):
return self._classes
@property
def registered_meta(self):
return self._meta
def create_instance(self, options: BaseOptions, **kwargs) -> A:
try:
_cls = self._classes[type(options)]
return _cls(options=options, **kwargs)
except KeyError:
raise RuntimeError(self._err_msg_on_class_not_found(options.kind))
def create_options(self, kind: str, *args, **kwargs) -> BaseOptions:
for opt_cls, _ in self._classes.items():
if opt_cls.kind == kind:
return opt_cls(*args, **kwargs)
raise RuntimeError(self._err_msg_on_class_not_found(kind))
def _err_msg_on_class_not_found(self, kind: str):
msg = []
for opt, cls in self._classes.items():
msg.append(f"\t{opt.kind!r} => {cls!r}")
msg_str = "\n".join(msg)
return f"No class found with the name {kind!r}, known classes are:\n{msg_str}"
def register(self, cls: Type[A], plugin_name: str, plugin_module_name: str):
opt_type = cls.get_options_type()
if opt_type in self._classes:
raise ValueError(
f"{opt_type.kind!r} already registered to class {self._classes[opt_type]!r}"
)
self._classes[opt_type] = cls
self._meta[opt_type] = FactoryMeta(
kind=opt_type.kind, plugin_name=plugin_name, module=plugin_module_name
)
def load_from_plugins(
self, plugin_name: Optional[str] = None, allow_external_plugins: bool = False
):
plugin_name = plugin_name or self.plugin_name
plugin_manager = PluginManager(plugin_name)
plugin_manager.load_setuptools_entrypoints(plugin_name)
for plugin_name, plugin_module in plugin_manager.list_name_plugin():
plugin_module_name = str(plugin_module.__name__) # type: ignore
if not allow_external_plugins and not plugin_module_name.startswith(
"docling."
):
logger.warning(
f"The plugin {plugin_name} will not be loaded because Docling is being executed with allow_external_plugins=false."
)
continue
attr = getattr(plugin_module, self.plugin_attr_name, None)
if callable(attr):
logger.info("Loading plugin %r", plugin_name)
config = attr()
self.process_plugin(config, plugin_name, plugin_module_name)
def process_plugin(self, config, plugin_name: str, plugin_module_name: str):
for item in config[self.plugin_attr_name]:
try:
self.register(item, plugin_name, plugin_module_name)
except ValueError:
logger.warning("%r already registered", item)

View File

@ -0,0 +1,11 @@
import logging
from docling.models.base_ocr_model import BaseOcrModel
from docling.models.factories.base_factory import BaseFactory
logger = logging.getLogger(__name__)
class OcrFactory(BaseFactory[BaseOcrModel]):
def __init__(self, *args, **kwargs):
super().__init__("ocr_engines", *args, **kwargs)

View File

@ -0,0 +1,11 @@
import logging
from docling.models.factories.base_factory import BaseFactory
from docling.models.picture_description_base_model import PictureDescriptionBaseModel
logger = logging.getLogger(__name__)
class PictureDescriptionFactory(BaseFactory[PictureDescriptionBaseModel]):
def __init__(self, *args, **kwargs):
super().__init__("picture_description", *args, **kwargs)

View File

@ -0,0 +1,137 @@
import logging
import time
from pathlib import Path
from typing import Iterable, List, Optional
from docling.datamodel.base_models import Page, VlmPrediction
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import (
AcceleratorDevice,
AcceleratorOptions,
HuggingFaceVlmOptions,
)
from docling.datamodel.settings import settings
from docling.models.base_model import BasePageModel
from docling.utils.accelerator_utils import decide_device
from docling.utils.profiling import TimeRecorder
_log = logging.getLogger(__name__)
class HuggingFaceMlxModel(BasePageModel):
def __init__(
self,
enabled: bool,
artifacts_path: Optional[Path],
accelerator_options: AcceleratorOptions,
vlm_options: HuggingFaceVlmOptions,
):
self.enabled = enabled
self.vlm_options = vlm_options
if self.enabled:
try:
from mlx_vlm import generate, load # type: ignore
from mlx_vlm.prompt_utils import apply_chat_template # type: ignore
from mlx_vlm.utils import load_config, stream_generate # type: ignore
except ImportError:
raise ImportError(
"mlx-vlm is not installed. Please install it via `pip install mlx-vlm` to use MLX VLM models."
)
repo_cache_folder = vlm_options.repo_id.replace("/", "--")
self.apply_chat_template = apply_chat_template
self.stream_generate = stream_generate
# PARAMETERS:
if artifacts_path is None:
artifacts_path = self.download_models(self.vlm_options.repo_id)
elif (artifacts_path / repo_cache_folder).exists():
artifacts_path = artifacts_path / repo_cache_folder
self.param_question = vlm_options.prompt # "Perform Layout Analysis."
## Load the model
self.vlm_model, self.processor = load(artifacts_path)
self.config = load_config(artifacts_path)
@staticmethod
def download_models(
repo_id: str,
local_dir: Optional[Path] = None,
force: bool = False,
progress: bool = False,
) -> Path:
from huggingface_hub import snapshot_download
from huggingface_hub.utils import disable_progress_bars
if not progress:
disable_progress_bars()
download_path = snapshot_download(
repo_id=repo_id,
force_download=force,
local_dir=local_dir,
# revision="v0.0.1",
)
return Path(download_path)
def __call__(
self, conv_res: ConversionResult, page_batch: Iterable[Page]
) -> Iterable[Page]:
for page in page_batch:
assert page._backend is not None
if not page._backend.is_valid():
yield page
else:
with TimeRecorder(conv_res, "vlm"):
assert page.size is not None
hi_res_image = page.get_image(scale=2.0) # 144dpi
# hi_res_image = page.get_image(scale=1.0) # 72dpi
if hi_res_image is not None:
im_width, im_height = hi_res_image.size
# populate page_tags with predicted doc tags
page_tags = ""
if hi_res_image:
if hi_res_image.mode != "RGB":
hi_res_image = hi_res_image.convert("RGB")
prompt = self.apply_chat_template(
self.processor, self.config, self.param_question, num_images=1
)
start_time = time.time()
# Call model to generate:
output = ""
for token in self.stream_generate(
self.vlm_model,
self.processor,
prompt,
[hi_res_image],
max_tokens=4096,
verbose=False,
):
output += token.text
if "</doctag>" in token.text:
break
generation_time = time.time() - start_time
page_tags = output
# inference_time = time.time() - start_time
# tokens_per_second = num_tokens / generation_time
# print("")
# print(f"Page Inference Time: {inference_time:.2f} seconds")
# print(f"Total tokens on page: {num_tokens:.2f}")
# print(f"Tokens/sec: {tokens_per_second:.2f}")
# print("")
page.predictions.vlm_response = VlmPrediction(text=page_tags)
yield page

View File

@ -1,13 +1,19 @@
import logging
import sys
import tempfile
from typing import Iterable, Optional, Tuple
from pathlib import Path
from typing import Iterable, Optional, Tuple, Type
from docling_core.types.doc import BoundingBox, CoordOrigin
from docling_core.types.doc.page import BoundingRectangle, TextCell
from docling.datamodel.base_models import Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import OcrMacOptions
from docling.datamodel.pipeline_options import (
AcceleratorOptions,
OcrMacOptions,
OcrOptions,
)
from docling.datamodel.settings import settings
from docling.models.base_ocr_model import BaseOcrModel
from docling.utils.profiling import TimeRecorder
@ -16,13 +22,26 @@ _log = logging.getLogger(__name__)
class OcrMacModel(BaseOcrModel):
def __init__(self, enabled: bool, options: OcrMacOptions):
super().__init__(enabled=enabled, options=options)
def __init__(
self,
enabled: bool,
artifacts_path: Optional[Path],
options: OcrMacOptions,
accelerator_options: AcceleratorOptions,
):
super().__init__(
enabled=enabled,
artifacts_path=artifacts_path,
options=options,
accelerator_options=accelerator_options,
)
self.options: OcrMacOptions
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
if self.enabled:
if "darwin" != sys.platform:
raise RuntimeError(f"OcrMac is only supported on Mac.")
install_errmsg = (
"ocrmac is not correctly installed. "
"Please install it via `pip install ocrmac` to use this OCR engine. "
@ -121,3 +140,7 @@ class OcrMacModel(BaseOcrModel):
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
yield page
@classmethod
def get_options_type(cls) -> Type[OcrOptions]:
return OcrMacOptions

View File

@ -63,7 +63,13 @@ class PagePreprocessingModel(BasePageModel):
def draw_text_boxes(image, cells, show: bool = False):
draw = ImageDraw.Draw(image)
for c in cells:
x0, y0, x1, y1 = c.bbox.as_tuple()
x0, y0, x1, y1 = (
c.to_bounding_box().l,
c.to_bounding_box().t,
c.to_bounding_box().r,
c.to_bounding_box().b,
)
draw.rectangle([(x0, y0), (x1, y1)], outline="red")
if show:
image.show()

View File

@ -1,13 +1,18 @@
import base64
import io
import logging
from typing import Iterable, List, Optional
from pathlib import Path
from typing import Iterable, List, Optional, Type, Union
import requests
from PIL import Image
from pydantic import BaseModel, ConfigDict
from docling.datamodel.pipeline_options import PictureDescriptionApiOptions
from docling.datamodel.pipeline_options import (
AcceleratorOptions,
PictureDescriptionApiOptions,
PictureDescriptionBaseOptions,
)
from docling.exceptions import OperationNotAllowed
from docling.models.picture_description_base_model import PictureDescriptionBaseModel
@ -46,13 +51,25 @@ class ApiResponse(BaseModel):
class PictureDescriptionApiModel(PictureDescriptionBaseModel):
# elements_batch_size = 4
@classmethod
def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
return PictureDescriptionApiOptions
def __init__(
self,
enabled: bool,
enable_remote_services: bool,
artifacts_path: Optional[Union[Path, str]],
options: PictureDescriptionApiOptions,
accelerator_options: AcceleratorOptions,
):
super().__init__(enabled=enabled, options=options)
super().__init__(
enabled=enabled,
enable_remote_services=enable_remote_services,
artifacts_path=artifacts_path,
options=options,
accelerator_options=accelerator_options,
)
self.options: PictureDescriptionApiOptions
if self.enabled:

View File

@ -1,6 +1,7 @@
import logging
from abc import abstractmethod
from pathlib import Path
from typing import Any, Iterable, List, Optional, Union
from typing import Any, Iterable, List, Optional, Type, Union
from docling_core.types.doc import (
DoclingDocument,
@ -13,20 +14,30 @@ from docling_core.types.doc.document import ( # TODO: move import to docling_co
)
from PIL import Image
from docling.datamodel.pipeline_options import PictureDescriptionBaseOptions
from docling.datamodel.pipeline_options import (
AcceleratorOptions,
PictureDescriptionBaseOptions,
)
from docling.models.base_model import (
BaseItemAndImageEnrichmentModel,
BaseModelWithOptions,
ItemAndImageEnrichmentElement,
)
class PictureDescriptionBaseModel(BaseItemAndImageEnrichmentModel):
class PictureDescriptionBaseModel(
BaseItemAndImageEnrichmentModel, BaseModelWithOptions
):
images_scale: float = 2.0
def __init__(
self,
*,
enabled: bool,
enable_remote_services: bool,
artifacts_path: Optional[Union[Path, str]],
options: PictureDescriptionBaseOptions,
accelerator_options: AcceleratorOptions,
):
self.enabled = enabled
self.options = options
@ -62,3 +73,8 @@ class PictureDescriptionBaseModel(BaseItemAndImageEnrichmentModel):
PictureDescriptionData(text=output, provenance=self.provenance)
)
yield item
@classmethod
@abstractmethod
def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
pass

View File

@ -1,10 +1,11 @@
from pathlib import Path
from typing import Iterable, Optional, Union
from typing import Iterable, Optional, Type, Union
from PIL import Image
from docling.datamodel.pipeline_options import (
AcceleratorOptions,
PictureDescriptionBaseOptions,
PictureDescriptionVlmOptions,
)
from docling.models.picture_description_base_model import PictureDescriptionBaseModel
@ -13,14 +14,25 @@ from docling.utils.accelerator_utils import decide_device
class PictureDescriptionVlmModel(PictureDescriptionBaseModel):
@classmethod
def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
return PictureDescriptionVlmOptions
def __init__(
self,
enabled: bool,
enable_remote_services: bool,
artifacts_path: Optional[Union[Path, str]],
options: PictureDescriptionVlmOptions,
accelerator_options: AcceleratorOptions,
):
super().__init__(enabled=enabled, options=options)
super().__init__(
enabled=enabled,
enable_remote_services=enable_remote_services,
artifacts_path=artifacts_path,
options=options,
accelerator_options=accelerator_options,
)
self.options: PictureDescriptionVlmOptions
if self.enabled:

View File

View File

@ -0,0 +1,28 @@
from docling.models.easyocr_model import EasyOcrModel
from docling.models.ocr_mac_model import OcrMacModel
from docling.models.picture_description_api_model import PictureDescriptionApiModel
from docling.models.picture_description_vlm_model import PictureDescriptionVlmModel
from docling.models.rapid_ocr_model import RapidOcrModel
from docling.models.tesseract_ocr_cli_model import TesseractOcrCliModel
from docling.models.tesseract_ocr_model import TesseractOcrModel
def ocr_engines():
return {
"ocr_engines": [
EasyOcrModel,
OcrMacModel,
RapidOcrModel,
TesseractOcrModel,
TesseractOcrCliModel,
]
}
def picture_description():
return {
"picture_description": [
PictureDescriptionVlmModel,
PictureDescriptionApiModel,
]
}

View File

@ -1,5 +1,6 @@
import logging
from typing import Iterable
from pathlib import Path
from typing import Iterable, Optional, Type
import numpy
from docling_core.types.doc import BoundingBox, CoordOrigin
@ -10,6 +11,7 @@ from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import (
AcceleratorDevice,
AcceleratorOptions,
OcrOptions,
RapidOcrOptions,
)
from docling.datamodel.settings import settings
@ -24,10 +26,16 @@ class RapidOcrModel(BaseOcrModel):
def __init__(
self,
enabled: bool,
artifacts_path: Optional[Path],
options: RapidOcrOptions,
accelerator_options: AcceleratorOptions,
):
super().__init__(enabled=enabled, options=options)
super().__init__(
enabled=enabled,
artifacts_path=artifacts_path,
options=options,
accelerator_options=accelerator_options,
)
self.options: RapidOcrOptions
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
@ -135,3 +143,7 @@ class RapidOcrModel(BaseOcrModel):
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
yield page
@classmethod
def get_options_type(cls) -> Type[OcrOptions]:
return RapidOcrOptions

View File

@ -3,8 +3,9 @@ import io
import logging
import os
import tempfile
from pathlib import Path
from subprocess import DEVNULL, PIPE, Popen
from typing import Iterable, List, Optional, Tuple
from typing import Iterable, List, Optional, Tuple, Type
import pandas as pd
from docling_core.types.doc import BoundingBox, CoordOrigin
@ -12,7 +13,11 @@ from docling_core.types.doc.page import BoundingRectangle, TextCell
from docling.datamodel.base_models import Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import TesseractCliOcrOptions
from docling.datamodel.pipeline_options import (
AcceleratorOptions,
OcrOptions,
TesseractCliOcrOptions,
)
from docling.datamodel.settings import settings
from docling.models.base_ocr_model import BaseOcrModel
from docling.utils.ocr_utils import map_tesseract_script
@ -22,8 +27,19 @@ _log = logging.getLogger(__name__)
class TesseractOcrCliModel(BaseOcrModel):
def __init__(self, enabled: bool, options: TesseractCliOcrOptions):
super().__init__(enabled=enabled, options=options)
def __init__(
self,
enabled: bool,
artifacts_path: Optional[Path],
options: TesseractCliOcrOptions,
accelerator_options: AcceleratorOptions,
):
super().__init__(
enabled=enabled,
artifacts_path=artifacts_path,
options=options,
accelerator_options=accelerator_options,
)
self.options: TesseractCliOcrOptions
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
@ -257,3 +273,7 @@ class TesseractOcrCliModel(BaseOcrModel):
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
yield page
@classmethod
def get_options_type(cls) -> Type[OcrOptions]:
return TesseractCliOcrOptions

View File

@ -1,12 +1,17 @@
import logging
from typing import Iterable
from pathlib import Path
from typing import Iterable, Optional, Type
from docling_core.types.doc import BoundingBox, CoordOrigin
from docling_core.types.doc.page import BoundingRectangle, TextCell
from docling.datamodel.base_models import Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import TesseractOcrOptions
from docling.datamodel.pipeline_options import (
AcceleratorOptions,
OcrOptions,
TesseractOcrOptions,
)
from docling.datamodel.settings import settings
from docling.models.base_ocr_model import BaseOcrModel
from docling.utils.ocr_utils import map_tesseract_script
@ -16,8 +21,19 @@ _log = logging.getLogger(__name__)
class TesseractOcrModel(BaseOcrModel):
def __init__(self, enabled: bool, options: TesseractOcrOptions):
super().__init__(enabled=enabled, options=options)
def __init__(
self,
enabled: bool,
artifacts_path: Optional[Path],
options: TesseractOcrOptions,
accelerator_options: AcceleratorOptions,
):
super().__init__(
enabled=enabled,
artifacts_path=artifacts_path,
options=options,
accelerator_options=accelerator_options,
)
self.options: TesseractOcrOptions
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
@ -200,3 +216,7 @@ class TesseractOcrModel(BaseOcrModel):
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
yield page
@classmethod
def get_options_type(cls) -> Type[OcrOptions]:
return TesseractOcrOptions

View File

@ -10,16 +10,7 @@ from docling.backend.abstract_backend import AbstractDocumentBackend
from docling.backend.pdf_backend import PdfDocumentBackend
from docling.datamodel.base_models import AssembledUnit, Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import (
EasyOcrOptions,
OcrMacOptions,
PdfPipelineOptions,
PictureDescriptionApiOptions,
PictureDescriptionVlmOptions,
RapidOcrOptions,
TesseractCliOcrOptions,
TesseractOcrOptions,
)
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.settings import settings
from docling.models.base_ocr_model import BaseOcrModel
from docling.models.code_formula_model import CodeFormulaModel, CodeFormulaModelOptions
@ -27,22 +18,16 @@ from docling.models.document_picture_classifier import (
DocumentPictureClassifier,
DocumentPictureClassifierOptions,
)
from docling.models.easyocr_model import EasyOcrModel
from docling.models.factories import get_ocr_factory, get_picture_description_factory
from docling.models.layout_model import LayoutModel
from docling.models.ocr_mac_model import OcrMacModel
from docling.models.page_assemble_model import PageAssembleModel, PageAssembleOptions
from docling.models.page_preprocessing_model import (
PagePreprocessingModel,
PagePreprocessingOptions,
)
from docling.models.picture_description_api_model import PictureDescriptionApiModel
from docling.models.picture_description_base_model import PictureDescriptionBaseModel
from docling.models.picture_description_vlm_model import PictureDescriptionVlmModel
from docling.models.rapid_ocr_model import RapidOcrModel
from docling.models.readingorder_model import ReadingOrderModel, ReadingOrderOptions
from docling.models.table_structure_model import TableStructureModel
from docling.models.tesseract_ocr_cli_model import TesseractOcrCliModel
from docling.models.tesseract_ocr_model import TesseractOcrModel
from docling.pipeline.base_pipeline import PaginatedPipeline
from docling.utils.model_downloader import download_models
from docling.utils.profiling import ProfilingScope, TimeRecorder
@ -78,10 +63,7 @@ class StandardPdfPipeline(PaginatedPipeline):
self.glm_model = ReadingOrderModel(options=ReadingOrderOptions())
if (ocr_model := self.get_ocr_model(artifacts_path=artifacts_path)) is None:
raise RuntimeError(
f"The specified OCR kind is not supported: {pipeline_options.ocr_options.kind}."
)
ocr_model = self.get_ocr_model(artifacts_path=artifacts_path)
self.build_pipe = [
# Pre-processing
@ -164,66 +146,30 @@ class StandardPdfPipeline(PaginatedPipeline):
output_dir = download_models(output_dir=local_dir, force=force, progress=False)
return output_dir
def get_ocr_model(
self, artifacts_path: Optional[Path] = None
) -> Optional[BaseOcrModel]:
if isinstance(self.pipeline_options.ocr_options, EasyOcrOptions):
return EasyOcrModel(
enabled=self.pipeline_options.do_ocr,
artifacts_path=artifacts_path,
options=self.pipeline_options.ocr_options,
accelerator_options=self.pipeline_options.accelerator_options,
)
elif isinstance(self.pipeline_options.ocr_options, TesseractCliOcrOptions):
return TesseractOcrCliModel(
enabled=self.pipeline_options.do_ocr,
options=self.pipeline_options.ocr_options,
)
elif isinstance(self.pipeline_options.ocr_options, TesseractOcrOptions):
return TesseractOcrModel(
enabled=self.pipeline_options.do_ocr,
options=self.pipeline_options.ocr_options,
)
elif isinstance(self.pipeline_options.ocr_options, RapidOcrOptions):
return RapidOcrModel(
enabled=self.pipeline_options.do_ocr,
options=self.pipeline_options.ocr_options,
accelerator_options=self.pipeline_options.accelerator_options,
)
elif isinstance(self.pipeline_options.ocr_options, OcrMacOptions):
if "darwin" != sys.platform:
raise RuntimeError(
f"The specified OCR type is only supported on Mac: {self.pipeline_options.ocr_options.kind}."
)
return OcrMacModel(
enabled=self.pipeline_options.do_ocr,
options=self.pipeline_options.ocr_options,
)
return None
def get_ocr_model(self, artifacts_path: Optional[Path] = None) -> BaseOcrModel:
factory = get_ocr_factory(
allow_external_plugins=self.pipeline_options.allow_external_plugins
)
return factory.create_instance(
options=self.pipeline_options.ocr_options,
enabled=self.pipeline_options.do_ocr,
artifacts_path=artifacts_path,
accelerator_options=self.pipeline_options.accelerator_options,
)
def get_picture_description_model(
self, artifacts_path: Optional[Path] = None
) -> Optional[PictureDescriptionBaseModel]:
if isinstance(
self.pipeline_options.picture_description_options,
PictureDescriptionApiOptions,
):
return PictureDescriptionApiModel(
enabled=self.pipeline_options.do_picture_description,
enable_remote_services=self.pipeline_options.enable_remote_services,
options=self.pipeline_options.picture_description_options,
)
elif isinstance(
self.pipeline_options.picture_description_options,
PictureDescriptionVlmOptions,
):
return PictureDescriptionVlmModel(
enabled=self.pipeline_options.do_picture_description,
artifacts_path=artifacts_path,
options=self.pipeline_options.picture_description_options,
accelerator_options=self.pipeline_options.accelerator_options,
)
return None
factory = get_picture_description_factory(
allow_external_plugins=self.pipeline_options.allow_external_plugins
)
return factory.create_instance(
options=self.pipeline_options.picture_description_options,
enabled=self.pipeline_options.do_picture_description,
enable_remote_services=self.pipeline_options.enable_remote_services,
artifacts_path=artifacts_path,
accelerator_options=self.pipeline_options.accelerator_options,
)
def initialize_page(self, conv_res: ConversionResult, page: Page) -> Page:
with TimeRecorder(conv_res, "page_init"):

View File

@ -1,30 +1,13 @@
import itertools
import logging
import re
import warnings
from io import BytesIO
# from io import BytesIO
from pathlib import Path
from typing import Optional
from typing import List, Optional, Union, cast
from docling_core.types import DoclingDocument
from docling_core.types.doc import (
BoundingBox,
DocItem,
DocItemLabel,
DoclingDocument,
GroupLabel,
ImageRef,
ImageRefMode,
PictureItem,
ProvenanceItem,
Size,
TableCell,
TableData,
TableItem,
)
from docling_core.types.doc.tokens import DocumentToken, TableToken
# from docling_core.types import DoclingDocument
from docling_core.types.doc import BoundingBox, DocItem, ImageRef, PictureItem, TextItem
from docling_core.types.doc.document import DocTagsDocument
from PIL import Image as PILImage
from docling.backend.abstract_backend import AbstractDocumentBackend
from docling.backend.md_backend import MarkdownDocumentBackend
@ -32,11 +15,12 @@ from docling.backend.pdf_backend import PdfDocumentBackend
from docling.datamodel.base_models import InputFormat, Page
from docling.datamodel.document import ConversionResult, InputDocument
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
InferenceFramework,
ResponseFormat,
VlmPipelineOptions,
)
from docling.datamodel.settings import settings
from docling.models.hf_mlx_model import HuggingFaceMlxModel
from docling.models.hf_vlm_model import HuggingFaceVlmModel
from docling.pipeline.base_pipeline import PaginatedPipeline
from docling.utils.profiling import ProfilingScope, TimeRecorder
@ -50,12 +34,6 @@ class VlmPipeline(PaginatedPipeline):
super().__init__(pipeline_options)
self.keep_backend = True
warnings.warn(
"The VlmPipeline is currently experimental and may change in upcoming versions without notice.",
category=UserWarning,
stacklevel=2,
)
self.pipeline_options: VlmPipelineOptions
artifacts_path: Optional[Path] = None
@ -79,14 +57,27 @@ class VlmPipeline(PaginatedPipeline):
self.keep_images = self.pipeline_options.generate_page_images
self.build_pipe = [
HuggingFaceVlmModel(
enabled=True, # must be always enabled for this pipeline to make sense.
artifacts_path=artifacts_path,
accelerator_options=pipeline_options.accelerator_options,
vlm_options=self.pipeline_options.vlm_options,
),
]
if (
self.pipeline_options.vlm_options.inference_framework
== InferenceFramework.MLX
):
self.build_pipe = [
HuggingFaceMlxModel(
enabled=True, # must be always enabled for this pipeline to make sense.
artifacts_path=artifacts_path,
accelerator_options=pipeline_options.accelerator_options,
vlm_options=self.pipeline_options.vlm_options,
),
]
else:
self.build_pipe = [
HuggingFaceVlmModel(
enabled=True, # must be always enabled for this pipeline to make sense.
artifacts_path=artifacts_path,
accelerator_options=pipeline_options.accelerator_options,
vlm_options=self.pipeline_options.vlm_options,
),
]
self.enrichment_pipe = [
# Other models working on `NodeItem` elements in the DoclingDocument
@ -100,6 +91,17 @@ class VlmPipeline(PaginatedPipeline):
return page
def extract_text_from_backend(
self, page: Page, bbox: Union[BoundingBox, None]
) -> str:
# Convert bounding box normalized to 0-100 into page coordinates for cropping
text = ""
if bbox:
if page.size:
if page._backend:
text = page._backend.get_text_in_rect(bbox)
return text
def _assemble_document(self, conv_res: ConversionResult) -> ConversionResult:
with TimeRecorder(conv_res, "doc_assemble", scope=ProfilingScope.DOCUMENT):
@ -107,7 +109,45 @@ class VlmPipeline(PaginatedPipeline):
self.pipeline_options.vlm_options.response_format
== ResponseFormat.DOCTAGS
):
conv_res.document = self._turn_tags_into_doc(conv_res.pages)
doctags_list = []
image_list = []
for page in conv_res.pages:
predicted_doctags = ""
img = PILImage.new("RGB", (1, 1), "rgb(255,255,255)")
if page.predictions.vlm_response:
predicted_doctags = page.predictions.vlm_response.text
if page.image:
img = page.image
image_list.append(img)
doctags_list.append(predicted_doctags)
doctags_list_c = cast(List[Union[Path, str]], doctags_list)
image_list_c = cast(List[Union[Path, PILImage.Image]], image_list)
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(
doctags_list_c, image_list_c
)
conv_res.document.load_from_doctags(doctags_doc)
# If forced backend text, replace model predicted text with backend one
if page.size:
if self.force_backend_text:
scale = self.pipeline_options.images_scale
for element, _level in conv_res.document.iterate_items():
if (
not isinstance(element, TextItem)
or len(element.prov) == 0
):
continue
crop_bbox = (
element.prov[0]
.bbox.scaled(scale=scale)
.to_top_left_origin(
page_height=page.size.height * scale
)
)
txt = self.extract_text_from_backend(page, crop_bbox)
element.text = txt
element.orig = txt
elif (
self.pipeline_options.vlm_options.response_format
== ResponseFormat.MARKDOWN
@ -165,366 +205,6 @@ class VlmPipeline(PaginatedPipeline):
)
return backend.convert()
def _turn_tags_into_doc(self, pages: list[Page]) -> DoclingDocument:
###############################################
# Tag definitions and color mappings
###############################################
# Maps the recognized tag to a Docling label.
# Code items will be given DocItemLabel.CODE
tag_to_doclabel = {
"title": DocItemLabel.TITLE,
"document_index": DocItemLabel.DOCUMENT_INDEX,
"otsl": DocItemLabel.TABLE,
"section_header_level_1": DocItemLabel.SECTION_HEADER,
"checkbox_selected": DocItemLabel.CHECKBOX_SELECTED,
"checkbox_unselected": DocItemLabel.CHECKBOX_UNSELECTED,
"text": DocItemLabel.TEXT,
"page_header": DocItemLabel.PAGE_HEADER,
"page_footer": DocItemLabel.PAGE_FOOTER,
"formula": DocItemLabel.FORMULA,
"caption": DocItemLabel.CAPTION,
"picture": DocItemLabel.PICTURE,
"list_item": DocItemLabel.LIST_ITEM,
"footnote": DocItemLabel.FOOTNOTE,
"code": DocItemLabel.CODE,
}
# Maps each tag to an associated bounding box color.
tag_to_color = {
"title": "blue",
"document_index": "darkblue",
"otsl": "green",
"section_header_level_1": "purple",
"checkbox_selected": "black",
"checkbox_unselected": "gray",
"text": "red",
"page_header": "orange",
"page_footer": "cyan",
"formula": "pink",
"caption": "magenta",
"picture": "yellow",
"list_item": "brown",
"footnote": "darkred",
"code": "lightblue",
}
def extract_bounding_box(text_chunk: str) -> Optional[BoundingBox]:
"""Extracts <loc_...> bounding box coords from the chunk, normalized by / 500."""
coords = re.findall(r"<loc_(\d+)>", text_chunk)
if len(coords) == 4:
l, t, r, b = map(float, coords)
return BoundingBox(l=l / 500, t=t / 500, r=r / 500, b=b / 500)
return None
def extract_inner_text(text_chunk: str) -> str:
"""Strips all <...> tags inside the chunk to get the raw text content."""
return re.sub(r"<.*?>", "", text_chunk, flags=re.DOTALL).strip()
def extract_text_from_backend(page: Page, bbox: BoundingBox | None) -> str:
# Convert bounding box normalized to 0-100 into page coordinates for cropping
text = ""
if bbox:
if page.size:
bbox.l = bbox.l * page.size.width
bbox.t = bbox.t * page.size.height
bbox.r = bbox.r * page.size.width
bbox.b = bbox.b * page.size.height
if page._backend:
text = page._backend.get_text_in_rect(bbox)
return text
def otsl_parse_texts(texts, tokens):
split_word = TableToken.OTSL_NL.value
split_row_tokens = [
list(y)
for x, y in itertools.groupby(tokens, lambda z: z == split_word)
if not x
]
table_cells = []
r_idx = 0
c_idx = 0
def count_right(tokens, c_idx, r_idx, which_tokens):
span = 0
c_idx_iter = c_idx
while tokens[r_idx][c_idx_iter] in which_tokens:
c_idx_iter += 1
span += 1
if c_idx_iter >= len(tokens[r_idx]):
return span
return span
def count_down(tokens, c_idx, r_idx, which_tokens):
span = 0
r_idx_iter = r_idx
while tokens[r_idx_iter][c_idx] in which_tokens:
r_idx_iter += 1
span += 1
if r_idx_iter >= len(tokens):
return span
return span
for i, text in enumerate(texts):
cell_text = ""
if text in [
TableToken.OTSL_FCEL.value,
TableToken.OTSL_ECEL.value,
TableToken.OTSL_CHED.value,
TableToken.OTSL_RHED.value,
TableToken.OTSL_SROW.value,
]:
row_span = 1
col_span = 1
right_offset = 1
if text != TableToken.OTSL_ECEL.value:
cell_text = texts[i + 1]
right_offset = 2
# Check next element(s) for lcel / ucel / xcel, set properly row_span, col_span
next_right_cell = ""
if i + right_offset < len(texts):
next_right_cell = texts[i + right_offset]
next_bottom_cell = ""
if r_idx + 1 < len(split_row_tokens):
if c_idx < len(split_row_tokens[r_idx + 1]):
next_bottom_cell = split_row_tokens[r_idx + 1][c_idx]
if next_right_cell in [
TableToken.OTSL_LCEL.value,
TableToken.OTSL_XCEL.value,
]:
# we have horisontal spanning cell or 2d spanning cell
col_span += count_right(
split_row_tokens,
c_idx + 1,
r_idx,
[TableToken.OTSL_LCEL.value, TableToken.OTSL_XCEL.value],
)
if next_bottom_cell in [
TableToken.OTSL_UCEL.value,
TableToken.OTSL_XCEL.value,
]:
# we have a vertical spanning cell or 2d spanning cell
row_span += count_down(
split_row_tokens,
c_idx,
r_idx + 1,
[TableToken.OTSL_UCEL.value, TableToken.OTSL_XCEL.value],
)
table_cells.append(
TableCell(
text=cell_text.strip(),
row_span=row_span,
col_span=col_span,
start_row_offset_idx=r_idx,
end_row_offset_idx=r_idx + row_span,
start_col_offset_idx=c_idx,
end_col_offset_idx=c_idx + col_span,
)
)
if text in [
TableToken.OTSL_FCEL.value,
TableToken.OTSL_ECEL.value,
TableToken.OTSL_CHED.value,
TableToken.OTSL_RHED.value,
TableToken.OTSL_SROW.value,
TableToken.OTSL_LCEL.value,
TableToken.OTSL_UCEL.value,
TableToken.OTSL_XCEL.value,
]:
c_idx += 1
if text == TableToken.OTSL_NL.value:
r_idx += 1
c_idx = 0
return table_cells, split_row_tokens
def otsl_extract_tokens_and_text(s: str):
# Pattern to match anything enclosed by < > (including the angle brackets themselves)
pattern = r"(<[^>]+>)"
# Find all tokens (e.g. "<otsl>", "<loc_140>", etc.)
tokens = re.findall(pattern, s)
# Remove any tokens that start with "<loc_"
tokens = [
token
for token in tokens
if not (
token.startswith(rf"<{DocumentToken.LOC.value}")
or token
in [
rf"<{DocumentToken.OTSL.value}>",
rf"</{DocumentToken.OTSL.value}>",
]
)
]
# Split the string by those tokens to get the in-between text
text_parts = re.split(pattern, s)
text_parts = [
token
for token in text_parts
if not (
token.startswith(rf"<{DocumentToken.LOC.value}")
or token
in [
rf"<{DocumentToken.OTSL.value}>",
rf"</{DocumentToken.OTSL.value}>",
]
)
]
# Remove any empty or purely whitespace strings from text_parts
text_parts = [part for part in text_parts if part.strip()]
return tokens, text_parts
def parse_table_content(otsl_content: str) -> TableData:
tokens, mixed_texts = otsl_extract_tokens_and_text(otsl_content)
table_cells, split_row_tokens = otsl_parse_texts(mixed_texts, tokens)
return TableData(
num_rows=len(split_row_tokens),
num_cols=(
max(len(row) for row in split_row_tokens) if split_row_tokens else 0
),
table_cells=table_cells,
)
doc = DoclingDocument(name="Document")
for pg_idx, page in enumerate(pages):
xml_content = ""
predicted_text = ""
if page.predictions.vlm_response:
predicted_text = page.predictions.vlm_response.text
image = page.image
page_no = pg_idx + 1
bounding_boxes = []
if page.size:
pg_width = page.size.width
pg_height = page.size.height
size = Size(width=pg_width, height=pg_height)
parent_page = doc.add_page(page_no=page_no, size=size)
"""
1. Finds all <tag>...</tag> blocks in the entire string (multi-line friendly) in the order they appear.
2. For each chunk, extracts bounding box (if any) and inner text.
3. Adds the item to a DoclingDocument structure with the right label.
4. Tracks bounding boxes + color in a separate list for later visualization.
"""
# Regex for all recognized tags
tag_pattern = (
rf"<(?P<tag>{DocItemLabel.TITLE}|{DocItemLabel.DOCUMENT_INDEX}|"
rf"{DocItemLabel.CHECKBOX_UNSELECTED}|{DocItemLabel.CHECKBOX_SELECTED}|"
rf"{DocItemLabel.TEXT}|{DocItemLabel.PAGE_HEADER}|"
rf"{DocItemLabel.PAGE_FOOTER}|{DocItemLabel.FORMULA}|"
rf"{DocItemLabel.CAPTION}|{DocItemLabel.PICTURE}|"
rf"{DocItemLabel.LIST_ITEM}|{DocItemLabel.FOOTNOTE}|{DocItemLabel.CODE}|"
rf"{DocItemLabel.SECTION_HEADER}_level_1|{DocumentToken.OTSL.value})>.*?</(?P=tag)>"
)
# DocumentToken.OTSL
pattern = re.compile(tag_pattern, re.DOTALL)
# Go through each match in order
for match in pattern.finditer(predicted_text):
full_chunk = match.group(0)
tag_name = match.group("tag")
bbox = extract_bounding_box(full_chunk)
doc_label = tag_to_doclabel.get(tag_name, DocItemLabel.PARAGRAPH)
color = tag_to_color.get(tag_name, "white")
# Store bounding box + color
if bbox:
bounding_boxes.append((bbox, color))
if tag_name == DocumentToken.OTSL.value:
table_data = parse_table_content(full_chunk)
bbox = extract_bounding_box(full_chunk)
if bbox:
prov = ProvenanceItem(
bbox=bbox.resize_by_scale(pg_width, pg_height),
charspan=(0, 0),
page_no=page_no,
)
doc.add_table(data=table_data, prov=prov)
else:
doc.add_table(data=table_data)
elif tag_name == DocItemLabel.PICTURE:
text_caption_content = extract_inner_text(full_chunk)
if image:
if bbox:
im_width, im_height = image.size
crop_box = (
int(bbox.l * im_width),
int(bbox.t * im_height),
int(bbox.r * im_width),
int(bbox.b * im_height),
)
cropped_image = image.crop(crop_box)
pic = doc.add_picture(
parent=None,
image=ImageRef.from_pil(image=cropped_image, dpi=72),
prov=(
ProvenanceItem(
bbox=bbox.resize_by_scale(pg_width, pg_height),
charspan=(0, 0),
page_no=page_no,
)
),
)
# If there is a caption to an image, add it as well
if len(text_caption_content) > 0:
caption_item = doc.add_text(
label=DocItemLabel.CAPTION,
text=text_caption_content,
parent=None,
)
pic.captions.append(caption_item.get_ref())
else:
if bbox:
# In case we don't have access to an binary of an image
doc.add_picture(
parent=None,
prov=ProvenanceItem(
bbox=bbox, charspan=(0, 0), page_no=page_no
),
)
# If there is a caption to an image, add it as well
if len(text_caption_content) > 0:
caption_item = doc.add_text(
label=DocItemLabel.CAPTION,
text=text_caption_content,
parent=None,
)
pic.captions.append(caption_item.get_ref())
else:
# For everything else, treat as text
if self.force_backend_text:
text_content = extract_text_from_backend(page, bbox)
else:
text_content = extract_inner_text(full_chunk)
doc.add_text(
label=doc_label,
text=text_content,
prov=(
ProvenanceItem(
bbox=bbox.resize_by_scale(pg_width, pg_height),
charspan=(0, len(text_content)),
page_no=page_no,
)
if bbox
else None
),
)
return doc
@classmethod
def get_default_options(cls) -> VlmPipelineOptions:
return VlmPipelineOptions()

View File

@ -154,7 +154,7 @@ def main():
conv_results = doc_converter.convert_all(
input_doc_paths,
raises_on_error=True, # to let conversion run through all and examine results at the end
raises_on_error=False, # to let conversion run through all and examine results at the end
)
success_count, partial_success_count, failure_count = export_documents(
conv_results, output_dir=Path("scratch")

View File

@ -10,13 +10,15 @@ from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
granite_vision_vlm_conversion_options,
smoldocling_vlm_conversion_options,
smoldocling_vlm_mlx_conversion_options,
)
from docling.datamodel.settings import settings
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
sources = [
"tests/data/2305.03393v1-pg9-img.png",
# "tests/data/2305.03393v1-pg9-img.png",
"tests/data/pdf/2305.03393v1-pg9.pdf",
]
## Use experimental VlmPipeline
@ -29,7 +31,10 @@ pipeline_options.force_backend_text = False
# pipeline_options.accelerator_options.cuda_use_flash_attention2 = True
## Pick a VLM model. We choose SmolDocling-256M by default
pipeline_options.vlm_options = smoldocling_vlm_conversion_options
# pipeline_options.vlm_options = smoldocling_vlm_conversion_options
## Pick a VLM model. Fast Apple Silicon friendly implementation for SmolDocling-256M via MLX
pipeline_options.vlm_options = smoldocling_vlm_mlx_conversion_options
## Alternative VLM models:
# pipeline_options.vlm_options = granite_vision_vlm_conversion_options
@ -63,9 +68,6 @@ for source in sources:
res = converter.convert(source)
print("------------------------------------------------")
print("MD:")
print("------------------------------------------------")
print("")
print(res.document.export_to_markdown())
@ -83,8 +85,17 @@ for source in sources:
with (out_path / f"{res.input.file.stem}.json").open("w") as fp:
fp.write(json.dumps(res.document.export_to_dict()))
pg_num = res.document.num_pages()
res.document.save_as_json(
out_path / f"{res.input.file.stem}.md",
image_mode=ImageRefMode.PLACEHOLDER,
)
res.document.save_as_markdown(
out_path / f"{res.input.file.stem}.md",
image_mode=ImageRefMode.PLACEHOLDER,
)
pg_num = res.document.num_pages()
print("")
inference_time = time.time() - start_time
print(

View File

@ -13,6 +13,7 @@
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
[![LF AI & Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation&logoColor=fff&color=0094ff&labelColor=003778)](https://lfaidata.foundation/projects/)
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
@ -25,12 +26,12 @@ Docling simplifies document processing, parsing diverse formats — including ad
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 Extensive OCR support for scanned PDFs and images
* 🥚 Support of Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🆕🔥
* 💻 Simple and convenient CLI
### Coming soon
* 📝 Metadata extraction, including title, authors, references & language
* 📝 Inclusion of Visual Language Models ([SmolDocling](https://huggingface.co/blog/smolervlm#smoldocling))
* 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
* 📝 Complex chemistry understanding (Molecular structures)
@ -43,9 +44,13 @@ Docling simplifies document processing, parsing diverse formats — including ad
<a href="reference/document_converter/" class="card"><b>Reference</b><br />See more API details</a>
</div>
## IBM ❤️ Open Source AI
## LF AI & Data
Docling has been brought to you by IBM.
Docling is hosted as a project in the [LF AI & Data Foundation](https://lfaidata.foundation/projects/).
### IBM ❤️ Open Source AI
The project was started by the AI for knowledge team at IBM Research Zurich.
[supported_formats]: ./usage/supported_formats.md
[docling_document]: ./concepts/docling_document.md

View File

@ -0,0 +1,35 @@
You can run Docling in the cloud without installation using the [Docling Actor][apify] on Apify platform. Simply provide a document URL and get the processed result:
<a href="https://apify.com/vancura/docling?fpr=docling"><img src="https://apify.com/ext/run-on-apify.png" alt="Run Docling Actor on Apify" width="176" height="39" /></a>
```bash
apify call vancura/docling -i '{
"options": {
"to_formats": ["md", "json", "html", "text", "doctags"]
},
"http_sources": [
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
{"url": "https://arxiv.org/pdf/2408.09869"}
]
}'
```
The Actor stores results in:
* Processed document in key-value store (`OUTPUT_RESULT`)
* Processing logs (`DOCLING_LOG`)
* Dataset record with result URL and status
Read more about the [Docling Actor](.actor/README.md), including how to use it via the Apify API and CLI.
- 💻 [GitHub][github]
- 📖 [Docs][docs]
- 📦 [Docling Actor][apify]
[github]: https://github.com/docling-project/docling/tree/main/.actor/
[docs]: https://github.com/docling-project/docling/tree/main/.actor/README.md
[apify]: https://apify.com/vancura/docling?fpr=docling

View File

@ -17,10 +17,15 @@ print(result.document.export_to_markdown()) # output: "### Docling Technical Re
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
A simple example would look like this:
```console
docling https://arxiv.org/pdf/2206.01062
```
You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
```bash
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
```
This will use MLX acceleration on supported Apple Silicon hardware.
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md).

View File

@ -111,6 +111,7 @@ nav:
- "LlamaIndex": integrations/llamaindex.md
- "txtai": integrations/txtai.md
- ⭐️ Featured:
- "Apify": integrations/apify.md
- "Data Prep Kit": integrations/data_prep_kit.md
- "InstructLab": integrations/instructlab.md
- "NVIDIA": integrations/nvidia.md

748
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@ -1,6 +1,6 @@
[tool.poetry]
name = "docling"
version = "2.26.0" # DO NOT EDIT, updated automatically
version = "2.28.0" # DO NOT EDIT, updated automatically
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
authors = [
"Christoph Auer <cau@zurich.ibm.com>",
@ -46,9 +46,9 @@ packages = [{ include = "docling" }]
######################
python = "^3.9"
pydantic = "^2.0.0"
docling-core = {extras = ["chunking"], version = "^2.23.0"}
docling-core = {extras = ["chunking"], version = "^2.23.1"}
docling-ibm-models = "^3.4.0"
docling-parse = "^4.0.0"
docling-parse = {git = "https://github.com/DS4SD/docling-parse", rev = "cau/line-sanitation-update"}
filetype = "^1.2.0"
pypdfium2 = "^4.30.0"
pydantic-settings = "^2.3.0"
@ -88,6 +88,7 @@ accelerate = [
]
pillow = ">=10.0.0,<12.0.0"
tqdm = "^4.65.0"
pluggy = "^1.0.0"
pylatexenc = "^2.10"
[tool.poetry.group.dev.dependencies]
@ -156,6 +157,9 @@ rapidocr = ["rapidocr-onnxruntime", "onnxruntime"]
docling = "docling.cli.main:app"
docling-tools = "docling.cli.tools:app"
[tool.poetry.plugins."docling"]
"docling_defaults" = "docling.models.plugins.defaults"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
@ -188,6 +192,7 @@ module = [
"docling_ibm_models.*",
"easyocr.*",
"ocrmac.*",
"mlx_vlm.*",
"lxml.*",
"huggingface_hub.*",
"transformers.*",

View File

@ -1,4 +1,5 @@
<document>
<paragraph><location><page_1><loc_3><loc_75><loc_6><loc_80></location>2022</paragraph>
<subtitle-level-1><location><page_1><loc_16><loc_85><loc_82><loc_86></location>TableFormer: Table Structure Understanding with Transformers.</subtitle-level-1>
<subtitle-level-1><location><page_1><loc_23><loc_78><loc_74><loc_81></location>Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research</subtitle-level-1>
<paragraph><location><page_1><loc_34><loc_77><loc_62><loc_78></location>{ ahn,nli,mly,taa @zurich.ibm.com }</paragraph>
@ -42,7 +43,7 @@
<paragraph><location><page_2><loc_10><loc_25><loc_47><loc_29></location>- · We present SynthTabNet a synthetically generated dataset, with various appearance styles and complexity.</paragraph>
<paragraph><location><page_2><loc_10><loc_19><loc_47><loc_24></location>- · An augmented dataset based on PubTabNet [37], FinTabNet [36], and TableBank [17] with generated ground-truth for reproducibility.</paragraph>
<paragraph><location><page_2><loc_8><loc_12><loc_47><loc_18></location>The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe</paragraph>
<paragraph><location><page_2><loc_50><loc_86><loc_89><loc_90></location>its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.</paragraph>
<paragraph><location><page_2><loc_50><loc_86><loc_89><loc_90></location>its results &performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.</paragraph>
<subtitle-level-1><location><page_2><loc_50><loc_83><loc_81><loc_85></location>2. Previous work and State of the Art</subtitle-level-1>
<paragraph><location><page_2><loc_50><loc_58><loc_89><loc_82></location>Identifying the structure of a table has been an outstanding problem in the document-parsing community, that motivates many organised public challenges [6, 4, 14]. The difficulty of the problem can be attributed to a number of factors. First, there is a large variety in the shapes and sizes of tables. Such large variety requires a flexible method. This is especially true for complex column- and row headers, which can be extremely intricate and demanding. A second factor of complexity is the lack of data with regard to table-structure. Until the publication of PubTabNet [37], there were no large datasets (i.e. > 100 K tables) that provided structure information. This happens primarily due to the fact that tables are notoriously time-consuming to annotate by hand. However, this has definitely changed in recent years with the deliverance of PubTabNet [37], FinTabNet [36], TableBank [17] etc.</paragraph>
<paragraph><location><page_2><loc_50><loc_43><loc_89><loc_58></location>Before the rising popularity of deep neural networks, the community relied heavily on heuristic and/or statistical methods to do table structure identification [3, 7, 11, 5, 13, 28]. Although such methods work well on constrained tables [12], a more data-driven approach can be applied due to the advent of convolutional neural networks (CNNs) and the availability of large datasets. To the best-of-our knowledge, there are currently two different types of network architecture that are being pursued for state-of-the-art tablestructure identification.</paragraph>
@ -58,7 +59,7 @@
<caption>Figure 2: Distribution of the tables across different table dimensions in PubTabNet + FinTabNet datasets</caption>
</figure>
<paragraph><location><page_3><loc_50><loc_59><loc_71><loc_60></location>balance in the previous datasets.</paragraph>
<paragraph><location><page_3><loc_50><loc_21><loc_89><loc_58></location>The PubTabNet dataset contains 509k tables delivered as annotated PNG images. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98% and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDF documents with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.</paragraph>
<paragraph><location><page_3><loc_50><loc_21><loc_89><loc_58></location>The PubTabNet dataset contains 509k tables delivered as annotated PNGimages. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98%and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDFdocuments with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.</paragraph>
<paragraph><location><page_3><loc_50><loc_10><loc_89><loc_20></location>Due to the heterogeneity across the dataset formats, it was necessary to combine all available data into one homogenized dataset before we could train our models for practical purposes. Given the size of PubTabNet, we adopted its annotation format and we extracted and converted all tables as PNG images with a resolution of 72 dpi. Additionally, we have filtered out tables with extreme sizes due to small</paragraph>
<paragraph><location><page_4><loc_8><loc_88><loc_47><loc_90></location>amount of such tables, and kept only those ones ranging between 1*1 and 20*10 (rows/columns).</paragraph>
<paragraph><location><page_4><loc_8><loc_60><loc_47><loc_87></location>The availability of the bounding boxes for all table cells is essential to train our models. In order to distinguish between empty and non-empty bounding boxes, we have introduced a binary class in the annotation. Unfortunately, the original datasets either omit the bounding boxes for whole tables (e.g. TableBank) or they narrow their scope only to non-empty cells. Therefore, it was imperative to introduce a data pre-processing procedure that generates the missing bounding boxes out of the annotation information. This procedure first parses the provided table structure and calculates the dimensions of the most fine-grained grid that covers the table structure. Notice that each table cell may occupy multiple grid squares due to row or column spans. In case of PubTabNet we had to compute missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</paragraph>
@ -95,7 +96,7 @@
<paragraph><location><page_5><loc_50><loc_63><loc_89><loc_68></location>forming classification, and adding an adaptive pooling layer of size 28*28. ResNet by default downsamples the image resolution by 32 and then the encoded image is provided to both the Structure Decoder , and Cell BBox Decoder .</paragraph>
<paragraph><location><page_5><loc_50><loc_48><loc_89><loc_62></location>Structure Decoder. The transformer architecture of this component is based on the work proposed in [31]. After extensive experimentation, the Structure Decoder is modeled as a transformer encoder with two encoder layers and a transformer decoder made from a stack of 4 decoder layers that comprise mainly of multi-head attention and feed forward layers. This configuration uses fewer layers and heads in comparison to networks applied to other problems (e.g. 'Scene Understanding', 'Image Captioning'), something which we relate to the simplicity of table images.</paragraph>
<paragraph><location><page_5><loc_50><loc_31><loc_89><loc_47></location>The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.</paragraph>
<paragraph><location><page_5><loc_50><loc_18><loc_89><loc_31></location>Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.</paragraph>
<paragraph><location><page_5><loc_50><loc_18><loc_89><loc_31></location>Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTMLstructure tags become the object query.</paragraph>
<paragraph><location><page_5><loc_50><loc_10><loc_89><loc_17></location>The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-</paragraph>
<paragraph><location><page_6><loc_8><loc_80><loc_47><loc_90></location>tention encoding is then multiplied to the encoded image to produce a feature for each table cell. Notice that this is different than the typical object detection problem where imbalances between the number of detections and the amount of objects may exist. In our case, we know up front that the produced detections always match with the table cells in number and correspondence.</paragraph>
<paragraph><location><page_6><loc_8><loc_70><loc_47><loc_80></location>The output features for each table cell are then fed into the feed-forward network (FFN). The FFN consists of a Multi-Layer Perceptron (3 layers with ReLU activation function) that predicts the normalized coordinates for the bounding box of each table cell. Finally, the predicted bounding boxes are classified based on whether they are empty or not using a linear layer.</paragraph>
@ -203,7 +204,7 @@
<location><page_8><loc_63><loc_44><loc_89><loc_52></location>
</figure>
<subtitle-level-1><location><page_8><loc_8><loc_37><loc_27><loc_38></location>5.5. Qualitative Analysis</subtitle-level-1>
<subtitle-level-1><location><page_8><loc_50><loc_37><loc_75><loc_38></location>6. Future Work & Conclusion</subtitle-level-1>
<subtitle-level-1><location><page_8><loc_50><loc_37><loc_75><loc_38></location>6. Future Work &Conclusion</subtitle-level-1>
<paragraph><location><page_8><loc_8><loc_10><loc_47><loc_32></location>We showcase several visualizations for the different components of our network on various 'complex' tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.</paragraph>
<paragraph><location><page_8><loc_50><loc_18><loc_89><loc_35></location>In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce 'SynthTabNet' a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.</paragraph>
<subtitle-level-1><location><page_8><loc_50><loc_14><loc_60><loc_15></location>References</subtitle-level-1>
@ -211,25 +212,25 @@
<paragraph><location><page_9><loc_11><loc_85><loc_47><loc_90></location>- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</paragraph>
<paragraph><location><page_9><loc_9><loc_81><loc_47><loc_85></location>- [2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</paragraph>
<paragraph><location><page_9><loc_9><loc_77><loc_47><loc_81></location>- [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</paragraph>
<paragraph><location><page_9><loc_9><loc_71><loc_47><loc_76></location>- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</paragraph>
<paragraph><location><page_9><loc_9><loc_71><loc_47><loc_76></location>- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</paragraph>
<paragraph><location><page_9><loc_9><loc_66><loc_47><loc_71></location>- [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2</paragraph>
<paragraph><location><page_9><loc_9><loc_60><loc_47><loc_65></location>- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</paragraph>
<paragraph><location><page_9><loc_9><loc_60><loc_47><loc_65></location>- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</paragraph>
<paragraph><location><page_9><loc_9><loc_56><loc_47><loc_60></location>- [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2</paragraph>
<paragraph><location><page_9><loc_9><loc_49><loc_47><loc_56></location>- [8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1</paragraph>
<paragraph><location><page_9><loc_9><loc_45><loc_47><loc_49></location>- [9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1</paragraph>
<paragraph><location><page_9><loc_8><loc_39><loc_47><loc_44></location>- [10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2</paragraph>
<paragraph><location><page_9><loc_8><loc_32><loc_47><loc_39></location>- [11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2</paragraph>
<paragraph><location><page_9><loc_8><loc_25><loc_47><loc_32></location>- [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2</paragraph>
<paragraph><location><page_9><loc_8><loc_18><loc_47><loc_25></location>- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</paragraph>
<paragraph><location><page_9><loc_8><loc_18><loc_47><loc_25></location>- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</paragraph>
<paragraph><location><page_9><loc_8><loc_14><loc_47><loc_18></location>- [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2</paragraph>
<paragraph><location><page_9><loc_8><loc_10><loc_47><loc_14></location>- [15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</paragraph>
<paragraph><location><page_9><loc_8><loc_10><loc_47><loc_14></location>- [15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</paragraph>
<paragraph><location><page_9><loc_50><loc_82><loc_89><loc_90></location>- [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4</paragraph>
<paragraph><location><page_9><loc_50><loc_78><loc_89><loc_82></location>- [17] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition, 2019. 2, 3</paragraph>
<paragraph><location><page_9><loc_50><loc_67><loc_89><loc_78></location>- [18] Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction. In Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, editors, Pattern Recognition. ICPR International Workshops and Challenges , pages 644-658, Cham, 2021. Springer International Publishing. 2, 3</paragraph>
<paragraph><location><page_9><loc_50><loc_59><loc_89><loc_67></location>- [19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1</paragraph>
<paragraph><location><page_9><loc_50><loc_53><loc_89><loc_58></location>- [20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2</paragraph>
<paragraph><location><page_9><loc_50><loc_45><loc_89><loc_53></location>- [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1</paragraph>
<paragraph><location><page_9><loc_50><loc_30><loc_89><loc_44></location>- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</paragraph>
<paragraph><location><page_9><loc_50><loc_30><loc_89><loc_44></location>- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</paragraph>
<paragraph><location><page_9><loc_50><loc_21><loc_89><loc_29></location>- [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1</paragraph>
<paragraph><location><page_9><loc_50><loc_16><loc_89><loc_21></location>- [24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3</paragraph>
<paragraph><location><page_9><loc_50><loc_10><loc_89><loc_15></location>- [25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on</paragraph>
@ -238,7 +239,7 @@
<paragraph><location><page_10><loc_8><loc_71><loc_47><loc_79></location>- [27] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) , volume 1, pages 1162-1167. IEEE, 2017. 3</paragraph>
<paragraph><location><page_10><loc_8><loc_66><loc_47><loc_71></location>- [28] Faisal Shafait and Ray Smith. Table detection in heterogeneous documents. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , pages 6572, 2010. 2</paragraph>
<paragraph><location><page_10><loc_8><loc_59><loc_47><loc_65></location>- [29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3</paragraph>
<paragraph><location><page_10><loc_8><loc_52><loc_47><loc_58></location>- [30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1</paragraph>
<paragraph><location><page_10><loc_8><loc_52><loc_47><loc_58></location>- [30] Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1</paragraph>
<paragraph><location><page_10><loc_8><loc_42><loc_47><loc_51></location>- [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5</paragraph>
<paragraph><location><page_10><loc_8><loc_37><loc_47><loc_42></location>- [32] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015. 2</paragraph>
<paragraph><location><page_10><loc_8><loc_31><loc_47><loc_36></location>- [33] Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: reconstruct syntactic structures from table images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 749-755. IEEE, 2019. 3</paragraph>
@ -251,7 +252,7 @@
<subtitle-level-1><location><page_11><loc_22><loc_83><loc_76><loc_86></location>TableFormer: Table Structure Understanding with Transformers Supplementary Material</subtitle-level-1>
<subtitle-level-1><location><page_11><loc_8><loc_78><loc_29><loc_80></location>1. Details on the datasets</subtitle-level-1>
<subtitle-level-1><location><page_11><loc_8><loc_76><loc_25><loc_77></location>1.1. Data preparation</subtitle-level-1>
<paragraph><location><page_11><loc_8><loc_51><loc_47><loc_75></location>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.</paragraph>
<paragraph><location><page_11><loc_8><loc_51><loc_47><loc_75></location>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). Atable is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTMLstructure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.</paragraph>
<paragraph><location><page_11><loc_8><loc_21><loc_47><loc_51></location>We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</paragraph>
<paragraph><location><page_11><loc_8><loc_18><loc_47><loc_20></location>Figure 7 illustrates the distribution of the tables across different dimensions per dataset.</paragraph>
<subtitle-level-1><location><page_11><loc_8><loc_15><loc_25><loc_16></location>1.2. Synthetic datasets</subtitle-level-1>

File diff suppressed because one or more lines are too long

View File

@ -1,3 +1,5 @@
2022
## TableFormer: Table Structure Understanding with Transformers.
## Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research
@ -54,7 +56,7 @@ To meet the design criteria listed above, we developed a new model called TableF
The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe
its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
its results &performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
## 2. Previous work and State of the Art
@ -81,7 +83,7 @@ Figure 2: Distribution of the tables across different table dimensions in PubTab
balance in the previous datasets.
The PubTabNet dataset contains 509k tables delivered as annotated PNG images. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98% and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDF documents with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.
The PubTabNet dataset contains 509k tables delivered as annotated PNGimages. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98%and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDFdocuments with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.
Due to the heterogeneity across the dataset formats, it was necessary to combine all available data into one homogenized dataset before we could train our models for practical purposes. Given the size of PubTabNet, we adopted its annotation format and we extracted and converted all tables as PNG images with a resolution of 72 dpi. Additionally, we have filtered out tables with extreme sizes due to small
@ -132,7 +134,7 @@ Structure Decoder. The transformer architecture of this component is based on th
The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTMLstructure tags become the object query.
The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-
@ -260,7 +262,7 @@ Figure 5: One of the benefits of TableFormer is that it is language agnostic, as
## 5.5. Qualitative Analysis
## 6. Future Work & Conclusion
## 6. Future Work &Conclusion
We showcase several visualizations for the different components of our network on various 'complex' tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.
@ -276,11 +278,11 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
- [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2
- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
- [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2
- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
- [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2
@ -294,11 +296,11 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
- [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
- [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2
- [15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
- [15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
- [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4
@ -312,7 +314,7 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
- [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
- [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1
@ -330,7 +332,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
- [29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3
- [30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
- [30] Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
- [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5
@ -356,7 +358,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
## 1.1. Data preparation
As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.
As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). Atable is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTMLstructure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.
We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.

File diff suppressed because one or more lines are too long

View File

@ -1,4 +1,5 @@
<document>
<paragraph><location><page_1><loc_3><loc_74><loc_6><loc_79></location>2022</paragraph>
<subtitle-level-1><location><page_1><loc_18><loc_85><loc_83><loc_89></location>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</subtitle-level-1>
<paragraph><location><page_1><loc_15><loc_77><loc_32><loc_83></location>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</paragraph>
<paragraph><location><page_1><loc_42><loc_77><loc_58><loc_83></location>Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com</paragraph>
@ -24,7 +25,7 @@
<paragraph><location><page_1><loc_52><loc_11><loc_91><loc_18></location>Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. 2022. DocLayNet: A Large Human-Annotated Dataset for DocumentLayout Analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22), August 14-18, 2022, Washington, DC, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/ 3534678.3539043</paragraph>
<subtitle-level-1><location><page_2><loc_9><loc_88><loc_26><loc_89></location>1 INTRODUCTION</subtitle-level-1>
<paragraph><location><page_2><loc_9><loc_71><loc_50><loc_86></location>Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.</paragraph>
<paragraph><location><page_2><loc_9><loc_37><loc_48><loc_71></location>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</paragraph>
<paragraph><location><page_2><loc_9><loc_37><loc_48><loc_71></location>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</paragraph>
<paragraph><location><page_2><loc_9><loc_27><loc_48><loc_36></location>In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:</paragraph>
<paragraph><location><page_2><loc_11><loc_22><loc_48><loc_26></location>- (1) Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.</paragraph>
<paragraph><location><page_2><loc_11><loc_20><loc_48><loc_22></location>- (2) Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.</paragraph>
@ -35,10 +36,10 @@
<paragraph><location><page_2><loc_52><loc_72><loc_91><loc_79></location>All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.</paragraph>
<paragraph><location><page_2><loc_52><loc_61><loc_91><loc_72></location>In Section 5, we will present baseline accuracy numbers for a variety of object detection methods (Faster R-CNN, Mask R-CNN and YOLOv5) trained on DocLayNet. We further show how the model performance is impacted by varying the DocLayNet dataset size, reducing the label set and modifying the train/test-split. Last but not least, we compare the performance of models trained on PubLayNet, DocBank and DocLayNet and demonstrate that a model trained on DocLayNet provides overall more robust layout recovery.</paragraph>
<subtitle-level-1><location><page_2><loc_52><loc_58><loc_69><loc_59></location>2 RELATED WORK</subtitle-level-1>
<paragraph><location><page_2><loc_52><loc_41><loc_91><loc_56></location>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</paragraph>
<paragraph><location><page_2><loc_52><loc_41><loc_91><loc_56></location>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most commonapproach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</paragraph>
<paragraph><location><page_2><loc_52><loc_30><loc_91><loc_41></location>Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.</paragraph>
<subtitle-level-1><location><page_2><loc_52><loc_27><loc_78><loc_29></location>3 THE DOCLAYNET DATASET</subtitle-level-1>
<paragraph><location><page_2><loc_52><loc_15><loc_91><loc_25></location>DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</paragraph>
<paragraph><location><page_2><loc_52><loc_15><loc_91><loc_25></location>DocLayNet contains 80863 PDF pages. Amongthese, 7059 carry two instances of humanannotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</paragraph>
<paragraph><location><page_2><loc_52><loc_11><loc_91><loc_14></location>In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents</paragraph>
<figure>
<location><page_3><loc_14><loc_72><loc_43><loc_88></location>
@ -56,7 +57,7 @@
<table>
<location><page_4><loc_16><loc_63><loc_84><loc_83></location>
<caption>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption>
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_5><col_6><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @0.5-0.95 (%)</col_11></row_0>
<row_1><col_0><col_header>class label</col_0><col_1><col_header>Count</col_1><col_2><col_header>Train</col_2><col_3><col_header>Test</col_3><col_4><col_header>Val</col_4><col_5><col_header>All</col_5><col_6><col_header>Fin</col_6><col_7><col_header>Man</col_7><col_8><col_header>Sci</col_8><col_9><col_header>Law</col_9><col_10><col_header>Pat</col_10><col_11><col_header>Ten</col_11></row_1>
<row_2><col_0><row_header>Caption</col_0><col_1><body>22524</col_1><col_2><body>2.04</col_2><col_3><body>1.77</col_3><col_4><body>2.32</col_4><col_5><body>84-89</col_5><col_6><body>40-61</col_6><col_7><body>86-92</col_7><col_8><body>94-99</col_8><col_9><body>95-99</col_9><col_10><body>69-78</col_10><col_11><body>n/a</col_11></row_2>
<row_3><col_0><row_header>Footnote</col_0><col_1><body>6318</col_1><col_2><body>0.60</col_2><col_3><body>0.31</col_3><col_4><body>0.58</col_4><col_5><body>83-91</col_5><col_6><body>n/a</col_6><col_7><body>100</col_7><col_8><body>62-88</col_8><col_9><body>85-94</col_9><col_10><body>n/a</col_10><col_11><body>82-97</col_11></row_3>
@ -77,9 +78,9 @@
<caption>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption>
</figure>
<paragraph><location><page_4><loc_9><loc_15><loc_48><loc_20></location>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</paragraph>
<paragraph><location><page_4><loc_9><loc_11><loc_48><loc_14></location><location><page_4><loc_9><loc_11><loc_48><loc_14></location>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</paragraph>
<paragraph><location><page_4><loc_9><loc_11><loc_48><loc_14></location><location><page_4><loc_9><loc_11><loc_48><loc_14></location>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. Alarge effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</paragraph>
<paragraph><location><page_4><loc_52><loc_36><loc_91><loc_52></location>Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.</paragraph>
<paragraph><location><page_4><loc_52><loc_13><loc_91><loc_36></location>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</paragraph>
<paragraph><location><page_4><loc_52><loc_13><loc_91><loc_36></location>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This wasachieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</paragraph>
<paragraph><location><page_5><loc_9><loc_87><loc_48><loc_89></location>the textual content of an element, which goes beyond visual layout recognition, in particular outside the Scientific Articles category.</paragraph>
<paragraph><location><page_5><loc_9><loc_69><loc_48><loc_86></location>At first sight, the task of visual document-layout interpretation appears intuitive enough to obtain plausible annotations in most cases. However, during early trial-runs in the core team, we observed many cases in which annotators use different annotation styles, especially for documents with challenging layouts. For example, if a figure is presented with subfigures, one annotator might draw a single figure bounding-box, while another might annotate each subfigure separately. The same applies for lists, where one might annotate all list items in one block or each list item separately. In essence, we observed that challenging layouts would be annotated in different but plausible ways. To illustrate this, we show in Figure 4 multiple examples of plausible but inconsistent annotations on the same pages.</paragraph>
<paragraph><location><page_5><loc_9><loc_57><loc_48><loc_68></location>Obviously, this inconsistency in annotations is not desirable for datasets which are intended to be used for model training. To minimise these inconsistencies, we created a detailed annotation guideline. While perfect consistency across 40 annotation staff members is clearly not possible to achieve, we saw a huge improvement in annotation consistency after the introduction of our annotation guideline. A few selected, non-trivial highlights of the guideline are:</paragraph>
@ -97,8 +98,8 @@
<paragraph><location><page_5><loc_65><loc_42><loc_78><loc_42></location>05237a14f2524e3f53c8454b074409d05078038a6a36b770fcc8ec7e540deae0</paragraph>
<caption><location><page_5><loc_52><loc_36><loc_91><loc_40></location>Figure 4: Examples of plausible annotation alternatives for the same page. Criteria in our annotation guideline can resolve cases A to C, while the case D remains ambiguous.</caption>
<paragraph><location><page_5><loc_52><loc_31><loc_91><loc_34></location>were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.</paragraph>
<paragraph><location><page_5><loc_52><loc_10><loc_91><loc_31></location>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted</paragraph>
<paragraph><location><page_6><loc_9><loc_77><loc_48><loc_89></location>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</paragraph>
<paragraph><location><page_5><loc_52><loc_10><loc_91><loc_31></location>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDFtext-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). Wewanted</paragraph>
<paragraph><location><page_6><loc_9><loc_77><loc_48><loc_89></location>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLOimplementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</paragraph>
<table>
<location><page_6><loc_10><loc_56><loc_47><loc_75></location>
<row_0><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>MRCNN</col_2><col_3><col_header>MRCNN</col_3><col_4><col_header>FRCNN</col_4><col_5><col_header>YOLO</col_5></row_0>
@ -118,10 +119,10 @@
</table>
<paragraph><location><page_6><loc_9><loc_27><loc_48><loc_53></location>to avoid this at any cost in order to have clear, unbiased baseline numbers for human document-layout annotation. Third, we introduced the feature of snapping boxes around text segments to obtain a pixel-accurate annotation and again reduce time and effort. The CCS annotation tool automatically shrinks every user-drawn box to the minimum bounding-box around the enclosed text-cells for all purely text-based segments, which excludes only Table and Picture . For the latter, we instructed annotation staff to minimise inclusion of surrounding whitespace while including all graphical lines. A downside of snapping boxes to enclosed text cells is that some wrongly parsed PDF pages cannot be annotated correctly and need to be skipped. Fourth, we established a way to flag pages as rejected for cases where no valid annotation according to the label guidelines could be achieved. Example cases for this would be PDF pages that render incorrectly or contain layouts that are impossible to capture with non-overlapping rectangles. Such rejected pages are not contained in the final dataset. With all these measures in place, experienced annotation staff managed to annotate a single page in a typical timeframe of 20s to 60s, depending on its complexity.</paragraph>
<subtitle-level-1><location><page_6><loc_9><loc_24><loc_24><loc_26></location>5 EXPERIMENTS</subtitle-level-1>
<paragraph><location><page_6><loc_9><loc_10><loc_48><loc_23></location>The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</paragraph>
<paragraph><location><page_6><loc_9><loc_10><loc_48><loc_23></location>The primary goal of DocLayNet is to obtain high-quality MLmodels capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</paragraph>
<figure>
<location><page_6><loc_53><loc_67><loc_90><loc_89></location>
<caption>Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNet dataset with similar data will not yield significantly better predictions.</caption>
<caption>Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNetdataset with similar data will not yield significantly better predictions.</caption>
</figure>
<paragraph><location><page_6><loc_52><loc_49><loc_91><loc_51></location>paper and leave the detailed evaluation of more recent methods mentioned in Section 2 for future work.</paragraph>
<paragraph><location><page_6><loc_52><loc_39><loc_91><loc_49></location>In this section, we will present several aspects related to the performance of object detection models on DocLayNet. Similarly as in PubLayNet, we will evaluate the quality of their predictions using mean average precision (mAP) with 10 overlaps that range from 0.5 to 0.95 in steps of 0.05 (mAP@0.5-0.95). These scores are computed by leveraging the evaluation code provided by the COCO API [16].</paragraph>
@ -168,10 +169,10 @@
</table>
<paragraph><location><page_7><loc_52><loc_47><loc_91><loc_58></location>lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items), the label set of size 4 is the closest to PubLayNet, in the assumption that the List is down-mapped to Text in PubLayNet. The results in Table 3 show that the prediction accuracy on the remaining class labels does not change significantly when other classes are merged into them. The overall macro-average improves by around 5%, in particular when Page-footer and Page-header are excluded.</paragraph>
<subtitle-level-1><location><page_7><loc_52><loc_45><loc_90><loc_46></location>Impact of Document Split in Train and Test Set</subtitle-level-1>
<paragraph><location><page_7><loc_52><loc_25><loc_91><loc_44></location>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</paragraph>
<paragraph><location><page_7><loc_52><loc_25><loc_91><loc_44></location>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</paragraph>
<subtitle-level-1><location><page_7><loc_52><loc_22><loc_68><loc_23></location>Dataset Comparison</subtitle-level-1>
<paragraph><location><page_7><loc_52><loc_11><loc_91><loc_21></location>Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,</paragraph>
<paragraph><location><page_8><loc_9><loc_81><loc_48><loc_89></location>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.</paragraph>
<paragraph><location><page_8><loc_9><loc_81><loc_48><loc_89></location>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.</paragraph>
<table>
<location><page_8><loc_12><loc_57><loc_45><loc_78></location>
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>Testing on</col_2><col_3><col_header>Testing on</col_3><col_4><col_header>Testing on</col_4></row_0>
@ -191,7 +192,7 @@
<row_14><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
</table>
<paragraph><location><page_8><loc_9><loc_44><loc_48><loc_51></location>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</paragraph>
<paragraph><location><page_8><loc_9><loc_26><loc_48><loc_44></location>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</paragraph>
<paragraph><location><page_8><loc_9><loc_26><loc_48><loc_44></location>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. Wehad to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</paragraph>
<subtitle-level-1><location><page_8><loc_9><loc_22><loc_25><loc_23></location>Example Predictions</subtitle-level-1>
<paragraph><location><page_8><loc_9><loc_11><loc_48><loc_22></location>To conclude this section, we illustrate the quality of layout predictions one can expect from DocLayNet-trained models by providing a selection of examples without any further post-processing applied. Figure 6 shows selected layout predictions on pages from the test-set of DocLayNet. Results look decent in general across document categories, however one can also observe mistakes such as overlapping clusters of different classes, or entirely missing boxes due to low confidence.</paragraph>
<subtitle-level-1><location><page_8><loc_52><loc_88><loc_66><loc_89></location>6 CONCLUSION</subtitle-level-1>
@ -202,7 +203,7 @@
<paragraph><location><page_8><loc_52><loc_53><loc_91><loc_56></location>- [1] Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.</paragraph>
<paragraph><location><page_8><loc_52><loc_49><loc_91><loc_53></location>- [2] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.</paragraph>
<paragraph><location><page_8><loc_52><loc_46><loc_91><loc_49></location>- [3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.</paragraph>
<paragraph><location><page_8><loc_52><loc_42><loc_91><loc_46></location>- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.</paragraph>
<paragraph><location><page_8><loc_52><loc_42><loc_91><loc_46></location>- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS12824, SpringerVerlag, sep 2021.</paragraph>
<paragraph><location><page_8><loc_52><loc_38><loc_91><loc_42></location>- [5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.</paragraph>
<paragraph><location><page_8><loc_52><loc_35><loc_91><loc_38></location>- [6] Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 1015-1022, sep 2019.</paragraph>
<paragraph><location><page_8><loc_52><loc_30><loc_91><loc_35></location>- [7] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics , COLING, pages 949-960. International Committee on Computational Linguistics, dec 2020.</paragraph>

File diff suppressed because one or more lines are too long

View File

@ -1,3 +1,5 @@
2022
## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com
@ -43,7 +45,7 @@ Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staa
Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.
Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:
@ -65,13 +67,13 @@ In Section 5, we will present baseline accuracy numbers for a variety of object
## 2 RELATED WORK
While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].
While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most commonapproach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].
Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.
## 3 THE DOCLAYNET DATASET
DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.
DocLayNet contains 80863 PDF pages. Amongthese, 7059 carry two instances of humanannotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.
In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents
@ -98,32 +100,32 @@ The annotation campaign was carried out in four phases. In phase one, we identif
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) |
|----------------|---------|--------------|--------------|--------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
<!-- image -->
we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.
Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.
Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. Alarge effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.
Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.
Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This wasachieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
the textual content of an element, which goes beyond visual layout recognition, in particular outside the Scientific Articles category.
@ -155,9 +157,9 @@ Figure 4: Examples of plausible annotation alternatives for the same page. Crite
were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDFtext-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). Wewanted
Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.
Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLOimplementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.
| | human | MRCNN | MRCNN | FRCNN | YOLO |
|----------------|---------|---------|---------|---------|--------|
@ -179,9 +181,9 @@ to avoid this at any cost in order to have clear, unbiased baseline numbers for
## 5 EXPERIMENTS
The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this
The primary goal of DocLayNet is to obtain high-quality MLmodels capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNet dataset with similar data will not yield significantly better predictions.
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNetdataset with similar data will not yield significantly better predictions.
<!-- image -->
paper and leave the detailed evaluation of more recent methods mentioned in Section 2 for future work.
@ -239,13 +241,13 @@ lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items),
## Impact of Document Split in Train and Test Set
Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
## Dataset Comparison
Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.
| | | Testing on | Testing on | Testing on |
|-----------------|------------|--------------|--------------|--------------|
@ -266,7 +268,7 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .
For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.
For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. Wehad to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.
## Example Predictions
@ -288,7 +290,7 @@ To date, there is still a significant gap between human and ML accuracy on the l
- [3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS12824, SpringerVerlag, sep 2021.
- [5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -1,4 +1,5 @@
<document>
<paragraph><location><page_1><loc_3><loc_74><loc_6><loc_79></location>2023</paragraph>
<subtitle-level-1><location><page_1><loc_22><loc_82><loc_79><loc_85></location>Optimized Table Tokenization for Table Structure Recognition</subtitle-level-1>
<paragraph><location><page_1><loc_23><loc_75><loc_78><loc_79></location>Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]</paragraph>
<paragraph><location><page_1><loc_38><loc_74><loc_49><loc_75></location>and Peter Staar</paragraph>
@ -21,7 +22,7 @@
<subtitle-level-1><location><page_3><loc_22><loc_40><loc_39><loc_42></location>2 Related Work</subtitle-level-1>
<paragraph><location><page_3><loc_22><loc_16><loc_79><loc_38></location>Approaches to formalize the logical structure and layout of tables in electronic documents date back more than two decades [16]. In the recent past, a wide variety of computer vision methods have been explored to tackle the problem of table structure recognition, i.e. the correct identification of columns, rows and spanning cells in a given table. Broadly speaking, the current deeplearning based approaches fall into three categories: object detection (OD) methods, Graph-Neural-Network (GNN) methods and Image-to-Markup-Sequence (Im2Seq) methods. Object-detection based methods [11,12,13,14,21] rely on tablestructure annotation using (overlapping) bounding boxes for training, and produce bounding-box predictions to define table cells, rows, and columns on a table image. Graph Neural Network (GNN) based methods [3,6,17,18], as the name suggests, represent tables as graph structures. The graph nodes represent the content of each table cell, an embedding vector from the table image, or geometric coordinates of the table cell. The edges of the graph define the relationship between the nodes, e.g. if they belong to the same column, row, or table cell.</paragraph>
<paragraph><location><page_4><loc_22><loc_67><loc_79><loc_85></location>Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.</paragraph>
<paragraph><location><page_4><loc_22><loc_39><loc_79><loc_67></location>Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.</paragraph>
<paragraph><location><page_4><loc_22><loc_39><loc_79><loc_67></location>Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCRand uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.</paragraph>
<paragraph><location><page_4><loc_22><loc_26><loc_79><loc_38></location>Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.</paragraph>
<subtitle-level-1><location><page_4><loc_22><loc_22><loc_44><loc_24></location>3 Problem Statement</subtitle-level-1>
<paragraph><location><page_4><loc_22><loc_16><loc_79><loc_20></location>All known Im2Seq based models for TSR fundamentally work in similar ways. Given an image of a table, the Im2Seq model predicts the structure of the table by generating a sequence of tokens. These tokens originate from a finite vocab-</paragraph>

File diff suppressed because one or more lines are too long

View File

@ -1,3 +1,5 @@
2023
## Optimized Table Tokenization for Table Structure Recognition
Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]
@ -37,7 +39,7 @@ Approaches to formalize the logical structure and layout of tables in electronic
Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCRand uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.

File diff suppressed because one or more lines are too long

View File

@ -1,21 +1,21 @@
<document>
<paragraph><location><page_1><loc_12><loc_88><loc_52><loc_94></location>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</paragraph>
<paragraph><location><page_1><loc_12><loc_77><loc_52><loc_86></location>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.</paragraph>
<paragraph><location><page_1><loc_12><loc_77><loc_52><loc_86></location>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; andthe elastic stop nut, representing the fiber insert type.</paragraph>
<subtitle-level-1><location><page_1><loc_12><loc_74><loc_28><loc_75></location>Boots Self-Locking Nut</subtitle-level-1>
<paragraph><location><page_1><loc_12><loc_64><loc_52><loc_73></location>nut is of one piece, all-metal The Boots self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</paragraph>
<paragraph><location><page_1><loc_12><loc_64><loc_52><loc_73></location>The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</paragraph>
<paragraph><location><page_1><loc_12><loc_52><loc_52><loc_61></location>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</paragraph>
<paragraph><location><page_1><loc_12><loc_38><loc_52><loc_50></location>The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</paragraph>
<paragraph><location><page_1><loc_12><loc_38><loc_52><loc_50></location>The spring, through the mediumofthe locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</paragraph>
<paragraph><location><page_1><loc_12><loc_33><loc_52><loc_36></location>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</paragraph>
<figure>
<location><page_1><loc_12><loc_10><loc_52><loc_31></location>
<caption>Figure 7-26. Self-locking nuts.</caption>
</figure>
<paragraph><location><page_1><loc_54><loc_85><loc_94><loc_94></location>the most common ranges in size for No. 6 up to 1 4 inch, the / Rol-top ranges from 1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</paragraph>
<paragraph><location><page_1><loc_54><loc_83><loc_55><loc_84></location>.</paragraph>
<paragraph><location><page_1><loc_54><loc_85><loc_94><loc_94></location>the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</paragraph>
<paragraph><location><page_1><loc_54><loc_83><loc_54><loc_84></location>.</paragraph>
<subtitle-level-1><location><page_1><loc_54><loc_82><loc_76><loc_83></location>Stainless Steel Self-Locking Nut</subtitle-level-1>
<paragraph><location><page_1><loc_54><loc_54><loc_94><loc_81></location>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</paragraph>
<paragraph><location><page_1><loc_54><loc_54><loc_94><loc_81></location>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</paragraph>
<subtitle-level-1><location><page_1><loc_54><loc_51><loc_65><loc_52></location>Elastic Stop Nut</subtitle-level-1>
<paragraph><location><page_1><loc_54><loc_47><loc_94><loc_50></location>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</paragraph>
<paragraph><location><page_1><loc_54><loc_47><loc_94><loc_50></location>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</paragraph>
<figure>
<location><page_1><loc_54><loc_11><loc_94><loc_46></location>
<caption>Figure 7-27. Stainless steel self-locking nut.</caption>

File diff suppressed because one or more lines are too long

View File

@ -1,31 +1,31 @@
pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.
The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.
The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; andthe elastic stop nut, representing the fiber insert type.
## Boots Self-Locking Nut
nut is of one piece, all-metal The Boots self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.
The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.
The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.
The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.
The spring, through the mediumofthe locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.
Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is
Figure 7-26. Self-locking nuts.
<!-- image -->
the most common ranges in size for No. 6 up to 1 4 inch, the / Rol-top ranges from 1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.
the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.
.
## Stainless Steel Self-Locking Nut
The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.
The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.
## Elastic Stop Nut
The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This
The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This
Figure 7-27. Stainless steel self-locking nut.
<!-- image -->

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -3,7 +3,7 @@
<figure>
<location><page_1><loc_84><loc_93><loc_96><loc_97></location>
</figure>
<subtitle-level-1><location><page_1><loc_6><loc_79><loc_96><loc_89></location>Row and Column Access Control Support in IBM DB2 for i</subtitle-level-1>
<subtitle-level-1><location><page_1><loc_6><loc_79><loc_94><loc_89></location>RowandColumnAccessControl Support in IBM DB2 for i</subtitle-level-1>
<figure>
<location><page_1><loc_5><loc_11><loc_96><loc_63></location>
</figure>
@ -16,19 +16,19 @@
<figure>
<location><page_3><loc_23><loc_64><loc_29><loc_66></location>
</figure>
<subtitle-level-1><location><page_3><loc_24><loc_57><loc_31><loc_59></location>Highlights</subtitle-level-1>
<subtitle-level-1><location><page_3><loc_24><loc_57><loc_30><loc_59></location>Highlights</subtitle-level-1>
<paragraph><location><page_3><loc_24><loc_55><loc_40><loc_57></location>- /g115/g3 /g40/g81/g75/g68/g81/g70/g72/g3 /g87/g75/g72/g3 /g83/g72/g85/g73/g82/g85/g80/g68/g81/g70/g72/g3 /g82/g73/g3 /g92/g82/g88/g85/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g82/g83/g72/g85/g68/g87/g76/g82/g81/g86</paragraph>
<paragraph><location><page_3><loc_24><loc_51><loc_42><loc_54></location>- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85 /g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86</paragraph>
<paragraph><location><page_3><loc_24><loc_51><loc_42><loc_54></location>- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85/g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86</paragraph>
<paragraph><location><page_3><loc_24><loc_48><loc_41><loc_50></location>- /g115/g3 /g53/g72/g79/g92/g3 /g82/g81/g3 /g44/g37/g48/g3 /g72/g91/g83/g72/g85/g87/g3 /g70/g82/g81/g86/g88/g79/g87/g76/g81/g74/g15/g3 /g86/g78/g76/g79/g79/g86/g3 /g86/g75/g68/g85/g76/g81/g74/g3 /g68/g81/g71/g3 /g85/g72/g81/g82/g90/g81/g3 /g86/g72/g85/g89/g76/g70/g72/g86</paragraph>
<paragraph><location><page_3><loc_24><loc_45><loc_38><loc_47></location>- /g115/g3 /g55 /g68/g78/g72/g3 /g68/g71/g89/g68/g81/g87/g68/g74/g72/g3 /g82/g73/g3 /g68/g70/g70/g72/g86/g86/g3 /g87/g82/g3 /g68/g3 /g90/g82/g85/g79/g71/g90/g76/g71/g72/g3 /g86/g82/g88/g85/g70/g72/g3 /g82/g73/g3 /g72/g91/g83/g72/g85/g87/g76/g86/g72</paragraph>
<figure>
<location><page_3><loc_10><loc_13><loc_42><loc_24></location>
</figure>
<paragraph><location><page_3><loc_75><loc_82><loc_83><loc_83></location>Power Services</paragraph>
<subtitle-level-1><location><page_3><loc_46><loc_65><loc_76><loc_71></location>DB2 for i Center of Excellence</subtitle-level-1>
<subtitle-level-1><location><page_3><loc_46><loc_65><loc_75><loc_71></location>DB2 for i Center of Excellence</subtitle-level-1>
<paragraph><location><page_3><loc_46><loc_64><loc_79><loc_65></location>Expert help to achieve your business requirements</paragraph>
<subtitle-level-1><location><page_3><loc_46><loc_59><loc_72><loc_60></location>We build confident, satisfied clients</subtitle-level-1>
<paragraph><location><page_3><loc_46><loc_56><loc_80><loc_59></location>No one else has the vast consulting experiences, skills sharing and renown service offerings to do what we can do for you.</paragraph>
<paragraph><location><page_3><loc_46><loc_56><loc_79><loc_59></location>No one else has the vast consulting experiences, skills sharing and renown service offerings to do what we can do for you.</paragraph>
<paragraph><location><page_3><loc_46><loc_54><loc_60><loc_55></location>Because no one else is IBM.</paragraph>
<paragraph><location><page_3><loc_46><loc_46><loc_82><loc_52></location>With combined experiences and direct access to development groups, we're the experts in IBM DB2® for i. The DB2 for i Center of Excellence (CoE) can help you achieve-perhaps reexamine and exceed-your business requirements and gain more confidence and satisfaction in IBM product data management products and solutions.</paragraph>
<subtitle-level-1><location><page_3><loc_46><loc_44><loc_71><loc_45></location>Who we are, some of what we do</subtitle-level-1>
@ -36,7 +36,7 @@
<paragraph><location><page_3><loc_46><loc_40><loc_66><loc_41></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Database performance and scalability</paragraph>
<paragraph><location><page_3><loc_46><loc_39><loc_69><loc_39></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Advanced SQL knowledge and skills transfer</paragraph>
<paragraph><location><page_3><loc_46><loc_37><loc_64><loc_38></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Business intelligence and analytics</paragraph>
<paragraph><location><page_3><loc_46><loc_36><loc_56><loc_37></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> DB2 Web Query</paragraph>
<paragraph><location><page_3><loc_46><loc_36><loc_56><loc_37></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> DB2Web Query</paragraph>
<paragraph><location><page_3><loc_46><loc_35><loc_82><loc_36></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Query/400 modernization for better reporting and analysis capabilities</paragraph>
<paragraph><location><page_3><loc_46><loc_33><loc_69><loc_34></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Database modernization and re-engineering</paragraph>
<paragraph><location><page_3><loc_46><loc_32><loc_65><loc_33></location>- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Data-centric architecture and design</paragraph>
@ -49,7 +49,7 @@
<figure>
<location><page_4><loc_23><loc_36><loc_41><loc_53></location>
</figure>
<paragraph><location><page_4><loc_43><loc_35><loc_89><loc_53></location>Jim Bainbridge is a senior DB2 consultant on the DB2 for i Center of Excellence team in the IBM Lab Services and Training organization. His primary role is training and implementation services for IBM DB2 Web Query for i and business analytics. Jim began his career with IBM 30 years ago in the IBM Rochester Development Lab, where he developed cooperative processing products that paired IBM PCs with IBM S/36 and AS/.400 systems. In the years since, Jim has held numerous technical roles, including independent software vendors technical support on a broad range of IBM technologies and products, and supporting customers in the IBM Executive Briefing Center and IBM Project Office.</paragraph>
<paragraph><location><page_4><loc_43><loc_35><loc_88><loc_53></location>Jim Bainbridge is a senior DB2 consultant on the DB2 for i Center of Excellence team in the IBM Lab Services and Training organization. His primary role is training and implementation services for IBM DB2 Web Query for i and business analytics. Jim began his career with IBM 30 years ago in the IBM Rochester Development Lab, where he developed cooperative processing products that paired IBM PCs with IBM S/36 and AS/.400 systems. In the years since, Jim has held numerous technical roles, including independent software vendors technical support on a broad range of IBM technologies and products, and supporting customers in the IBM Executive Briefing Center and IBM Project Office.</paragraph>
<figure>
<location><page_4><loc_24><loc_20><loc_41><loc_33></location>
</figure>
@ -60,77 +60,81 @@
</figure>
<paragraph><location><page_5><loc_82><loc_84><loc_85><loc_88></location>1</paragraph>
<paragraph><location><page_5><loc_13><loc_65><loc_19><loc_66></location>Chapter 1.</paragraph>
<subtitle-level-1><location><page_5><loc_22><loc_61><loc_90><loc_68></location>Securing and protecting IBM DB2 data</subtitle-level-1>
<paragraph><location><page_5><loc_22><loc_46><loc_89><loc_56></location>Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.</paragraph>
<subtitle-level-1><location><page_5><loc_22><loc_61><loc_89><loc_68></location>Securing and protecting IBM DB2 data</subtitle-level-1>
<paragraph><location><page_5><loc_22><loc_46><loc_89><loc_56></location>Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.</paragraph>
<paragraph><location><page_5><loc_22><loc_38><loc_86><loc_44></location>Businesses must make a serious effort to secure their data and recognize that securing information assets is a cost of doing business. In many parts of the world and in many industries, securing the data is required by law and subject to audits. Data security is no longer an option; it is a requirement.</paragraph>
<paragraph><location><page_5><loc_22><loc_34><loc_89><loc_37></location>This chapter describes how you can secure and protect data in DB2 for i. The following topics are covered in this chapter:</paragraph>
<paragraph><location><page_5><loc_22><loc_32><loc_41><loc_33></location>- /SM590000 Security fundamentals</paragraph>
<paragraph><location><page_5><loc_22><loc_30><loc_46><loc_32></location>- /SM590000 Current state of IBM i security</paragraph>
<paragraph><location><page_5><loc_22><loc_29><loc_43><loc_30></location>- /SM590000 DB2 for i security controls</paragraph>
<subtitle-level-1><location><page_6><loc_11><loc_89><loc_44><loc_91></location>1.1 Security fundamentals</subtitle-level-1>
<subtitle-level-1><location><page_6><loc_11><loc_89><loc_44><loc_91></location>1.1 Security fundamentals</subtitle-level-1>
<paragraph><location><page_6><loc_22><loc_84><loc_89><loc_87></location>Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:</paragraph>
<paragraph><location><page_6><loc_22><loc_77><loc_89><loc_83></location>- /SM590000 First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.</paragraph>
<paragraph><location><page_6><loc_25><loc_66><loc_89><loc_76></location>- The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.</paragraph>
<paragraph><location><page_6><loc_25><loc_64><loc_89><loc_65></location>A security policy is what defines whether the system and its settings are secure (or not).</paragraph>
<paragraph><location><page_6><loc_25><loc_64><loc_88><loc_65></location>A security policy is what defines whether the system and its settings are secure (or not).</paragraph>
<paragraph><location><page_6><loc_22><loc_53><loc_89><loc_63></location>- /SM590000 The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.</paragraph>
<paragraph><location><page_6><loc_22><loc_48><loc_87><loc_51></location>With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.</paragraph>
<subtitle-level-1><location><page_6><loc_11><loc_43><loc_53><loc_45></location>1.2 Current state of IBM i security</subtitle-level-1>
<subtitle-level-1><location><page_6><loc_11><loc_43><loc_53><loc_45></location>1.2 Current state of IBM i security</subtitle-level-1>
<paragraph><location><page_6><loc_22><loc_35><loc_89><loc_41></location>Because of the inherently secure nature of IBM i, many clients rely on the default system settings to protect their business data that is stored in DB2 for i. In most cases, this means no data protection because the default setting for the Create default public authority (QCRTAUT) system value is *CHANGE.</paragraph>
<paragraph><location><page_6><loc_22><loc_26><loc_89><loc_33></location>Even more disturbing is that many IBM i clients remain in this state, despite the news headlines and the significant costs that are involved with databases being compromised. This default security configuration makes it quite challenging to implement basic security policies. A tighter implementation is required if you really want to protect one of your company's most valuable assets, which is the data.</paragraph>
<paragraph><location><page_6><loc_22><loc_14><loc_89><loc_24></location>Traditionally, IBM i applications have employed menu-based security to counteract this default configuration that gives all users access to the data. The theory is that data is protected by the menu options controlling what database operations that the user can perform. This approach is ineffective, even if the user profile is restricted from running interactive commands. The reason is that in today's connected world there are a multitude of interfaces into the system, from web browsers to PC clients, that bypass application menus. If there are no object-level controls, users of these newer interfaces have an open door to your data.</paragraph>
<paragraph><location><page_7><loc_22><loc_81><loc_89><loc_91></location>Many businesses are trying to limit data access to a need-to-know basis. This security goal means that users should be given access only to the minimum set of data that is required to perform their job. Often, users with object-level access are given access to row and column values that are beyond what their business task requires because that object-level security provides an all-or-nothing solution. For example, object-level controls allow a manager to access data about all employees. Most security policies limit a manager to accessing data only for the employees that they manage.</paragraph>
<subtitle-level-1><location><page_7><loc_11><loc_77><loc_49><loc_78></location>1.3.1 Existing row and column control</subtitle-level-1>
<paragraph><location><page_7><loc_22><loc_81><loc_88><loc_91></location>Many businesses are trying to limit data access to a need-to-know basis. This security goal means that users should be given access only to the minimum set of data that is required to perform their job. Often, users with object-level access are given access to row and column values that are beyond what their business task requires because that object-level security provides an all-or-nothing solution. For example, object-level controls allow a manager to access data about all employees. Most security policies limit a manager to accessing data only for the employees that they manage.</paragraph>
<subtitle-level-1><location><page_7><loc_11><loc_77><loc_49><loc_78></location>1.3.1 Existing row and column control</subtitle-level-1>
<paragraph><location><page_7><loc_22><loc_68><loc_88><loc_75></location>Some IBM i clients have tried augmenting the all-or-nothing object-level security with SQL views (or logical files) and application logic, as shown in Figure 1-2. However, application-based logic is easy to bypass with all of the different data access interfaces that are provided by the IBM i operating system, such as Open Database Connectivity (ODBC) and System i Navigator.</paragraph>
<paragraph><location><page_7><loc_22><loc_60><loc_89><loc_66></location>Using SQL views to limit access to a subset of the data in a table also has its own set of challenges. First, there is the complexity of managing all of the SQL view objects that are used for securing data access. Second, scaling a view-based security solution can be difficult as the amount of data grows and the number of users increases.</paragraph>
<paragraph><location><page_7><loc_22><loc_54><loc_89><loc_59></location>Even if you are willing to live with these performance and management issues, a user with *ALLOBJ access still can directly access all of the data in the underlying DB2 table and easily bypass the security controls that are built into an SQL view.</paragraph>
<figure>
<location><page_7><loc_22><loc_13><loc_89><loc_53></location>
<caption>Figure 1-2 Existing row and column controls</caption>
<caption>Figure 1-2 Existing row and column controls</caption>
</figure>
<subtitle-level-1><location><page_8><loc_11><loc_89><loc_55><loc_91></location>2.1.6 Change Function Usage CL command</subtitle-level-1>
<subtitle-level-1><location><page_8><loc_11><loc_89><loc_55><loc_91></location>2.1.6 Change Function Usage CL command</subtitle-level-1>
<paragraph><location><page_8><loc_22><loc_87><loc_89><loc_88></location>The following CL commands can be used to work with, display, or change function usage IDs:</paragraph>
<paragraph><location><page_8><loc_22><loc_84><loc_49><loc_86></location>- /SM590000 Work Function Usage ( WRKFCNUSG )</paragraph>
<paragraph><location><page_8><loc_22><loc_83><loc_51><loc_84></location>- /SM590000 Change Function Usage ( CHGFCNUSG )</paragraph>
<paragraph><location><page_8><loc_22><loc_81><loc_51><loc_83></location>- /SM590000 Display Function Usage ( DSPFCNUSG )</paragraph>
<paragraph><location><page_8><loc_22><loc_77><loc_84><loc_80></location>For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:</paragraph>
<paragraph><location><page_8><loc_22><loc_77><loc_83><loc_80></location>For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:</paragraph>
<paragraph><location><page_8><loc_22><loc_75><loc_72><loc_76></location>CHGFCNUSG FCNID(QIBM_DB_SECADM) USER(HBEDOYA) USAGE(*ALLOWED)</paragraph>
<subtitle-level-1><location><page_8><loc_11><loc_71><loc_89><loc_72></location>2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view</subtitle-level-1>
<paragraph><location><page_8><loc_22><loc_66><loc_85><loc_69></location>The FUNCTION_USAGE view contains function usage configuration details. Table 2-1 describes the columns in the FUNCTION_USAGE view.</paragraph>
<subtitle-level-1><location><page_8><loc_11><loc_71><loc_89><loc_72></location>2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view</subtitle-level-1>
<paragraph><location><page_8><loc_22><loc_66><loc_84><loc_69></location>The FUNCTION_USAGE view contains function usage configuration details. Table 2-1 describes the columns in the FUNCTION_USAGE view.</paragraph>
<table>
<location><page_8><loc_22><loc_44><loc_89><loc_63></location>
<caption>Table 2-1 FUNCTION_USAGE view</caption>
<caption>Table 2-1 FUNCTION_USAGE view</caption>
<row_0><col_0><col_header>Column name</col_0><col_1><col_header>Data type</col_1><col_2><col_header>Description</col_2></row_0>
<row_1><col_0><body>FUNCTION_ID</col_0><col_1><body>VARCHAR(30)</col_1><col_2><body>ID of the function.</col_2></row_1>
<row_2><col_0><body>USER_NAME</col_0><col_1><body>VARCHAR(10)</col_1><col_2><body>Name of the user profile that has a usage setting for this function.</col_2></row_2>
<row_2><col_0><body>USER_NAME</col_0><col_1><body>VARCHAR(10)</col_1><col_2><body>Name of the user profile that has a usage setting for this function.</col_2></row_2>
<row_3><col_0><body>USAGE</col_0><col_1><body>VARCHAR(7)</col_1><col_2><body>Usage setting: /SM590000 ALLOWED: The user profile is allowed to use the function. /SM590000 DENIED: The user profile is not allowed to use the function.</col_2></row_3>
<row_4><col_0><body>USER_TYPE</col_0><col_1><body>VARCHAR(5)</col_1><col_2><body>Type of user profile: /SM590000 USER: The user profile is a user. /SM590000 GROUP: The user profile is a group.</col_2></row_4>
</table>
<caption><location><page_8><loc_22><loc_64><loc_46><loc_65></location>Table 2-1 FUNCTION_USAGE view</caption>
<caption><location><page_8><loc_22><loc_64><loc_46><loc_65></location>Table 2-1 FUNCTION_USAGE view</caption>
<paragraph><location><page_8><loc_22><loc_40><loc_89><loc_43></location>To discover who has authorization to define and manage RCAC, you can use the query that is shown in Example 2-1.</paragraph>
<caption><location><page_8><loc_22><loc_38><loc_76><loc_39></location>Example 2-1 Query to determine who has authority to define and manage RCAC</caption>
<paragraph><location><page_8><loc_22><loc_35><loc_41><loc_36></location>SELECT function_id,</paragraph>
<paragraph><location><page_8><loc_22><loc_34><loc_39><loc_35></location>user_name,</paragraph>
<paragraph><location><page_8><loc_22><loc_32><loc_36><loc_33></location>usage,</paragraph>
<paragraph><location><page_8><loc_22><loc_31><loc_39><loc_32></location>user_type</paragraph>
<paragraph><location><page_8><loc_22><loc_29><loc_43><loc_30></location>FROM function_usage</paragraph>
<paragraph><location><page_8><loc_22><loc_28><loc_54><loc_29></location>WHERE function_id='QIBM_DB_SECADM'</paragraph>
<paragraph><location><page_8><loc_22><loc_26><loc_39><loc_27></location>ORDER BY user_name;</paragraph>
<subtitle-level-1><location><page_8><loc_11><loc_20><loc_41><loc_22></location>2.2 Separation of duties</subtitle-level-1>
<caption><location><page_8><loc_22><loc_38><loc_76><loc_39></location>Example 2-1 Query to determine who has authority to define and manage RCAC</caption>
<paragraph><location><page_8><loc_22><loc_35><loc_27><loc_36></location>SELECT</paragraph>
<paragraph><location><page_8><loc_31><loc_35><loc_41><loc_36></location>function_id,</paragraph>
<paragraph><location><page_8><loc_31><loc_34><loc_39><loc_35></location>user_name,</paragraph>
<paragraph><location><page_8><loc_31><loc_32><loc_36><loc_33></location>usage,</paragraph>
<paragraph><location><page_8><loc_31><loc_31><loc_39><loc_32></location>user_type</paragraph>
<paragraph><location><page_8><loc_22><loc_29><loc_26><loc_30></location>FROM</paragraph>
<paragraph><location><page_8><loc_31><loc_29><loc_43><loc_30></location>function_usage</paragraph>
<paragraph><location><page_8><loc_22><loc_28><loc_26><loc_29></location>WHERE</paragraph>
<paragraph><location><page_8><loc_31><loc_28><loc_54><loc_29></location>function_id='QIBM_DB_SECADM'</paragraph>
<paragraph><location><page_8><loc_22><loc_26><loc_29><loc_27></location>ORDER BY</paragraph>
<paragraph><location><page_8><loc_31><loc_26><loc_39><loc_27></location>user_name;</paragraph>
<subtitle-level-1><location><page_8><loc_11><loc_20><loc_41><loc_22></location>2.2 Separation of duties</subtitle-level-1>
<paragraph><location><page_8><loc_22><loc_10><loc_89><loc_18></location>Separation of duties helps businesses comply with industry regulations or organizational requirements and simplifies the management of authorities. Separation of duties is commonly used to prevent fraudulent activities or errors by a single person. It provides the ability for administrative functions to be divided across individuals without overlapping responsibilities, so that one user does not possess unlimited authority, such as with the *ALLOBJ authority.</paragraph>
<paragraph><location><page_9><loc_22><loc_82><loc_89><loc_91></location>For example, assume that a business has assigned the duty to manage security on IBM i to Theresa. Before release IBM i 7.2, to grant privileges, Theresa had to have the same privileges Theresa was granting to others. Therefore, to grant *USE privileges to the PAYROLL table, Theresa had to have *OBJMGT and *USE authority (or a higher level of authority, such as *ALLOBJ). This requirement allowed Theresa to access the data in the PAYROLL table even though Theresa's job description was only to manage its security.</paragraph>
<paragraph><location><page_9><loc_22><loc_82><loc_88><loc_91></location>For example, assume that a business has assigned the duty to manage security on IBM i to Theresa. Before release IBM i 7.2, to grant privileges, Theresa had to have the same privileges Theresa was granting to others. Therefore, to grant *USE privileges to the PAYROLL table, Theresa had to have *OBJMGT and *USE authority (or a higher level of authority, such as *ALLOBJ). This requirement allowed Theresa to access the data in the PAYROLL table even though Theresa's job description was only to manage its security.</paragraph>
<paragraph><location><page_9><loc_22><loc_75><loc_89><loc_81></location>In IBM i 7.2, the QIBM_DB_SECADM function usage grants authorities, revokes authorities, changes ownership, or changes the primary group without giving access to the object or, in the case of a database table, to the data that is in the table or allowing other operations on the table.</paragraph>
<paragraph><location><page_9><loc_22><loc_71><loc_88><loc_73></location>QIBM_DB_SECADM function usage can be granted only by a user with *SECADM special authority and can be given to a user or a group.</paragraph>
<paragraph><location><page_9><loc_22><loc_65><loc_89><loc_69></location>QIBM_DB_SECADM also is responsible for administering RCAC, which restricts which rows a user is allowed to access in a table and whether a user is allowed to see information in certain columns of a table.</paragraph>
<paragraph><location><page_9><loc_22><loc_57><loc_88><loc_63></location>A preferred practice is that the RCAC administrator has the QIBM_DB_SECADM function usage ID, but absolutely no other data privileges. The result is that the RCAC administrator can deploy and maintain the RCAC constructs, but cannot grant themselves unauthorized access to data itself.</paragraph>
<paragraph><location><page_9><loc_22><loc_53><loc_89><loc_56></location>Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL authority to the different CL commands and DB2 for i tools.</paragraph>
<paragraph><location><page_9><loc_22><loc_53><loc_88><loc_56></location>Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL authority to the different CL commands and DB2 for i tools.</paragraph>
<table>
<location><page_9><loc_11><loc_9><loc_89><loc_50></location>
<caption>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
<caption>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
<row_0><col_0><body>User action</col_0><col_1><col_header>*JOBCTL</col_1><col_2><col_header>QIBM_DB_SECADM</col_2><col_3><col_header>QIBM_DB_SQLADM</col_3><col_4><col_header>QIBM_DB_SYSMON</col_4><col_5><col_header>No Authority</col_5></row_0>
<row_1><col_0><row_header>SET CURRENT DEGREE (SQL statement)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_1>
<row_2><col_0><row_header>CHGQRYA command targeting a different user's job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_2>
<row_3><col_0><row_header>STRDBMON or ENDDBMON commands targeting a different user's job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_3>
<row_4><col_0><row_header>STRDBMON or ENDDBMON commands targeting a job that matches the current user</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body>X</col_4><col_5><body>X</col_5></row_4>
<row_1><col_0><row_header>SET CURRENT DEGREE (SQL statement)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_1>
<row_2><col_0><row_header>CHGQRYA command targeting a different user's job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_2>
<row_3><col_0><row_header>STRDBMON or ENDDBMON commands targeting a different user's job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_3>
<row_4><col_0><row_header>STRDBMON or ENDDBMON commands targeting a job that matches the current user</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body>X</col_4><col_5><body>X</col_5></row_4>
<row_5><col_0><row_header>QUSRJOBI() API format 900 or System i Navigator's SQL Details for Job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body>X</col_4><col_5><body></col_5></row_5>
<row_6><col_0><row_header>Visual Explain within Run SQL scripts</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body>X</col_4><col_5><body>X</col_5></row_6>
<row_7><col_0><row_header>Visual Explain outside of Run SQL scripts</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_7>
@ -140,24 +144,24 @@
<row_11><col_0><row_header>MODIFY PLAN CACHE PROPERTIES procedure (currently does not check authority)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_11>
<row_12><col_0><row_header>CHANGE PLAN CACHE SIZE procedure (currently does not check authority)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_12>
</table>
<caption><location><page_9><loc_11><loc_51><loc_64><loc_52></location>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
<caption><location><page_9><loc_11><loc_51><loc_64><loc_52></location>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
<caption><location><page_10><loc_22><loc_88><loc_86><loc_91></location>The SQL CREATE PERMISSION statement that is shown in Figure 3-1 is used to define and initially enable or disable the row access rules.</caption>
<figure>
<location><page_10><loc_22><loc_48><loc_89><loc_86></location>
<caption>Figure 3-1 CREATE PERMISSION SQL statement</caption>
<caption>Figure 3-1 CREATE PERMISSION SQL statement</caption>
</figure>
<subtitle-level-1><location><page_10><loc_22><loc_43><loc_35><loc_44></location>Column mask</subtitle-level-1>
<paragraph><location><page_10><loc_22><loc_37><loc_89><loc_43></location>A column mask is a database object that manifests a column value access control rule for a specific column in a specific table. It uses a CASE expression that describes what you see when you access the column. For example, a teller can see only the last four digits of a tax identification number.</paragraph>
<paragraph><location><page_10><loc_22><loc_37><loc_88><loc_43></location>A column mask is a database object that manifests a column value access control rule for a specific column in a specific table. It uses a CASE expression that describes what you see when you access the column. For example, a teller can see only the last four digits of a tax identification number.</paragraph>
<caption><location><page_11><loc_22><loc_90><loc_67><loc_91></location>Table 3-1 summarizes these special registers and their values.</caption>
<table>
<location><page_11><loc_22><loc_74><loc_89><loc_87></location>
<caption>Table 3-1 Special registers and their corresponding values</caption>
<caption>Table 3-1 Special registers and their corresponding values</caption>
<row_0><col_0><col_header>Special register</col_0><col_1><col_header>Corresponding value</col_1></row_0>
<row_1><col_0><body>USER or SESSION_USER</col_0><col_1><body>The effective user of the thread excluding adopted authority.</col_1></row_1>
<row_2><col_0><body>CURRENT_USER</col_0><col_1><body>The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER.</col_1></row_2>
<row_2><col_0><body>CURRENT_USER</col_0><col_1><body>The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER.</col_1></row_2>
<row_3><col_0><body>SYSTEM_USER</col_0><col_1><body>The authorization ID that initiated the connection.</col_1></row_3>
</table>
<caption><location><page_11><loc_22><loc_87><loc_61><loc_88></location>Table 3-1 Special registers and their corresponding values</caption>
<caption><location><page_11><loc_22><loc_87><loc_61><loc_88></location>Table 3-1 Special registers and their corresponding values</caption>
<paragraph><location><page_11><loc_22><loc_70><loc_88><loc_73></location>Figure 3-5 shows the difference in the special register values when an adopted authority is used:</paragraph>
<paragraph><location><page_11><loc_22><loc_68><loc_67><loc_69></location>- /SM590000 A user connects to the server using the user profile ALICE.</paragraph>
<paragraph><location><page_11><loc_22><loc_66><loc_74><loc_67></location>- /SM590000 USER and CURRENT USER initially have the same value of ALICE.</paragraph>
@ -166,15 +170,15 @@
<paragraph><location><page_11><loc_22><loc_53><loc_89><loc_56></location>- /SM590000 When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.</paragraph>
<figure>
<location><page_11><loc_22><loc_25><loc_49><loc_51></location>
<caption>Figure 3-5 Special registers and adopted authority</caption>
<caption>Figure 3-5 Special registers and adopted authority</caption>
</figure>
<subtitle-level-1><location><page_11><loc_11><loc_20><loc_40><loc_21></location>3.2.2 Built-in global variables</subtitle-level-1>
<paragraph><location><page_11><loc_22><loc_15><loc_85><loc_18></location>Built-in global variables are provided with the database manager and are used in SQL statements to retrieve scalar values that are associated with the variables.</paragraph>
<subtitle-level-1><location><page_11><loc_11><loc_20><loc_40><loc_21></location>3.2.2 Built-in global variables</subtitle-level-1>
<paragraph><location><page_11><loc_22><loc_15><loc_84><loc_18></location>Built-in global variables are provided with the database manager and are used in SQL statements to retrieve scalar values that are associated with the variables.</paragraph>
<paragraph><location><page_11><loc_22><loc_9><loc_87><loc_13></location>IBM DB2 for i supports nine different built-in global variables that are read only and maintained by the system. These global variables can be used to identify attributes of the database connection and used as part of the RCAC logic.</paragraph>
<paragraph><location><page_12><loc_22><loc_90><loc_56><loc_91></location>Table 3-2 lists the nine built-in global variables.</paragraph>
<table>
<location><page_12><loc_10><loc_63><loc_90><loc_87></location>
<caption>Table 3-2 Built-in global variables</caption>
<caption>Table 3-2 Built-in global variables</caption>
<row_0><col_0><col_header>Global variable</col_0><col_1><col_header>Type</col_1><col_2><col_header>Description</col_2></row_0>
<row_1><col_0><body>CLIENT_HOST</col_0><col_1><body>VARCHAR(255)</col_1><col_2><body>Host name of the current client as returned by the system</col_2></row_1>
<row_2><col_0><body>CLIENT_IPADDR</col_0><col_1><body>VARCHAR(128)</col_1><col_2><body>IP address of the current client as returned by the system</col_2></row_2>
@ -186,63 +190,70 @@
<row_8><col_0><body>ROUTINE_SPECIFIC_NAME</col_0><col_1><body>VARCHAR(128)</col_1><col_2><body>Name of the currently running routine</col_2></row_8>
<row_9><col_0><body>ROUTINE_TYPE</col_0><col_1><body>CHAR(1)</col_1><col_2><body>Type of the currently running routine</col_2></row_9>
</table>
<caption><location><page_12><loc_11><loc_87><loc_33><loc_88></location>Table 3-2 Built-in global variables</caption>
<subtitle-level-1><location><page_12><loc_11><loc_57><loc_63><loc_59></location>3.3 VERIFY_GROUP_FOR_USER function</subtitle-level-1>
<caption><location><page_12><loc_11><loc_87><loc_33><loc_88></location>Table 3-2 Built-in global variables</caption>
<subtitle-level-1><location><page_12><loc_11><loc_57><loc_63><loc_59></location>3.3 VERIFY_GROUP_FOR_USER function</subtitle-level-1>
<paragraph><location><page_12><loc_22><loc_45><loc_89><loc_55></location>The VERIFY_GROUP_FOR_USER function was added in IBM i 7.2. Although it is primarily intended for use with RCAC permissions and masks, it can be used in other SQL statements. The first parameter must be one of these three special registers: SESSION_USER, USER, or CURRENT_USER. The second and subsequent parameters are a list of user or group profiles. Each of these values must be 1 - 10 characters in length. These values are not validated for their existence, which means that you can specify the names of user profiles that do not exist without receiving any kind of error.</paragraph>
<paragraph><location><page_12><loc_22><loc_39><loc_89><loc_43></location>If a special register value is in the list of user profiles or it is a member of a group profile included in the list, the function returns a long integer value of 1. Otherwise, it returns a value of 0. It never returns the null value.</paragraph>
<paragraph><location><page_12><loc_22><loc_36><loc_75><loc_38></location>Here is an example of using the VERIFY_GROUP_FOR_USER function:</paragraph>
<paragraph><location><page_12><loc_22><loc_34><loc_66><loc_35></location>- 1. There are user profiles for MGR, JANE, JUDY, and TONY.</paragraph>
<paragraph><location><page_12><loc_22><loc_32><loc_65><loc_33></location>- 2. The user profile JANE specifies a group profile of MGR.</paragraph>
<paragraph><location><page_12><loc_22><loc_28><loc_88><loc_31></location>- 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:</paragraph>
<paragraph><location><page_12><loc_25><loc_19><loc_74><loc_27></location>VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', 'STEVE') The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')</paragraph>
<paragraph><location><page_12><loc_22><loc_28><loc_87><loc_31></location>- 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:</paragraph>
<paragraph><location><page_12><loc_25><loc_19><loc_67><loc_27></location>VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')</paragraph>
<paragraph><location><page_12><loc_67><loc_23><loc_74><loc_24></location>'STEVE')</paragraph>
<paragraph><location><page_13><loc_22><loc_90><loc_27><loc_91></location>RETURN</paragraph>
<paragraph><location><page_13><loc_22><loc_88><loc_26><loc_89></location>CASE</paragraph>
<paragraph><location><page_13><loc_22><loc_67><loc_85><loc_88></location>WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;</paragraph>
<paragraph><location><page_13><loc_23><loc_67><loc_85><loc_88></location>WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;</paragraph>
<paragraph><location><page_13><loc_22><loc_63><loc_89><loc_65></location>- 2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:</paragraph>
<paragraph><location><page_13><loc_25><loc_60><loc_77><loc_62></location>- -Human Resources can see the unmasked TAX_ID of the employees.</paragraph>
<paragraph><location><page_13><loc_25><loc_58><loc_66><loc_59></location>- -Employees can see only their own unmasked TAX_ID.</paragraph>
<paragraph><location><page_13><loc_25><loc_55><loc_89><loc_57></location>- -Managers see a masked version of TAX_ID with the first five characters replaced with the X character (for example, XXX-XX-1234).</paragraph>
<paragraph><location><page_13><loc_25><loc_52><loc_87><loc_54></location>- -Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.</paragraph>
<paragraph><location><page_13><loc_25><loc_50><loc_87><loc_51></location>- To implement this column mask, run the SQL statement that is shown in Example 3-9.</paragraph>
<paragraph><location><page_13><loc_22><loc_14><loc_86><loc_46></location>CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;</paragraph>
<caption><location><page_13><loc_22><loc_48><loc_58><loc_49></location>Example 3-9 Creating a mask on the TAX_ID column</caption>
<paragraph><location><page_13><loc_22><loc_14><loc_86><loc_46></location>CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;</paragraph>
<caption><location><page_13><loc_22><loc_48><loc_58><loc_49></location>Example 3-9 Creating a mask on the TAX_ID column</caption>
<paragraph><location><page_14><loc_22><loc_90><loc_74><loc_91></location>- 3. Figure 3-10 shows the masks that are created in the HR_SCHEMA.</paragraph>
<figure>
<location><page_14><loc_10><loc_79><loc_89><loc_88></location>
<caption>Figure 3-10 Column masks shown in System i Navigator</caption>
<caption>Figure 3-10 Column masks shown in System i Navigator</caption>
</figure>
<subtitle-level-1><location><page_14><loc_11><loc_73><loc_33><loc_74></location>3.6.6 Activating RCAC</subtitle-level-1>
<subtitle-level-1><location><page_14><loc_11><loc_73><loc_33><loc_74></location>3.6.6 Activating RCAC</subtitle-level-1>
<paragraph><location><page_14><loc_22><loc_67><loc_89><loc_71></location>Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:</paragraph>
<paragraph><location><page_14><loc_22><loc_65><loc_67><loc_66></location>- 1. Run the SQL statements that are shown in Example 3-10.</paragraph>
<subtitle-level-1><location><page_14><loc_22><loc_62><loc_61><loc_63></location>Example 3-10 Activating RCAC on the EMPLOYEES table</subtitle-level-1>
<paragraph><location><page_14><loc_22><loc_60><loc_62><loc_61></location>- /* Active Row Access Control (permissions) */</paragraph>
<paragraph><location><page_14><loc_22><loc_58><loc_62><loc_59></location>- /* Active Column Access Control (masks) */</paragraph>
<paragraph><location><page_14><loc_22><loc_57><loc_48><loc_58></location>ALTER TABLE HR_SCHEMA.EMPLOYEES</paragraph>
<paragraph><location><page_14><loc_22><loc_55><loc_44><loc_56></location>ACTIVATE ROW ACCESS CONTROL</paragraph>
<subtitle-level-1><location><page_14><loc_22><loc_62><loc_61><loc_63></location>Example 3-10 Activating RCAC on the EMPLOYEES table</subtitle-level-1>
<paragraph><location><page_14><loc_22><loc_60><loc_58><loc_61></location>- /* Active Row Access Control (permissions)</paragraph>
<paragraph><location><page_14><loc_60><loc_60><loc_62><loc_61></location>*/</paragraph>
<paragraph><location><page_14><loc_22><loc_58><loc_56><loc_59></location>- /* Active Column Access Control (masks)</paragraph>
<paragraph><location><page_14><loc_22><loc_57><loc_26><loc_58></location>ALTER</paragraph>
<paragraph><location><page_14><loc_27><loc_57><loc_48><loc_58></location>TABLE HR_SCHEMA.EMPLOYEES</paragraph>
<paragraph><location><page_14><loc_22><loc_55><loc_32><loc_56></location>ACTIVATE ROW</paragraph>
<paragraph><location><page_14><loc_33><loc_55><loc_38><loc_56></location>ACCESS</paragraph>
<paragraph><location><page_14><loc_39><loc_55><loc_44><loc_56></location>CONTROL</paragraph>
<paragraph><location><page_14><loc_22><loc_54><loc_48><loc_55></location>ACTIVATE COLUMN ACCESS CONTROL;</paragraph>
<paragraph><location><page_14><loc_22><loc_48><loc_88><loc_52></location>- 2. Look at the definition of the EMPLOYEE table, as shown in Figure 3-11. To do this, from the main navigation pane of System i Navigator, click Schemas  HR_SCHEMA  Tables , right-click the EMPLOYEES table, and click Definition .</paragraph>
<figure>
<location><page_14><loc_10><loc_18><loc_87><loc_46></location>
<caption>Figure 3-11 Selecting the EMPLOYEES table from System i Navigator</caption>
<caption>Figure 3-11 Selecting the EMPLOYEES table from System i Navigator</caption>
</figure>
<paragraph><location><page_14><loc_60><loc_58><loc_62><loc_59></location>*/</paragraph>
<paragraph><location><page_15><loc_22><loc_87><loc_84><loc_91></location>- 2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.</paragraph>
<paragraph><location><page_15><loc_22><loc_32><loc_89><loc_36></location>- 3. Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.</paragraph>
<figure>
<location><page_15><loc_22><loc_40><loc_89><loc_85></location>
<caption>Figure 4-68 Visual Explain with RCAC enabled</caption>
<caption>Figure 4-68 Visual Explain with RCAC enabled</caption>
</figure>
<figure>
<location><page_15><loc_11><loc_16><loc_83><loc_30></location>
<caption>Figure 4-69 Index advice with no RCAC</caption>
<caption>Figure 4-69 Index advice with no RCAC</caption>
</figure>
<paragraph><location><page_16><loc_11><loc_11><loc_82><loc_91></location>THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;</paragraph>
<paragraph><location><page_16><loc_11><loc_11><loc_80><loc_91></location>THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;</paragraph>
<paragraph><location><page_16><loc_80><loc_29><loc_81><loc_30></location>C</paragraph>
<paragraph><location><page_18><loc_47><loc_94><loc_68><loc_96></location>Back cover</paragraph>
<subtitle-level-1><location><page_18><loc_4><loc_82><loc_73><loc_91></location>Row and Column Access Control Support in IBM DB2 for i</subtitle-level-1>
<paragraph><location><page_18><loc_4><loc_66><loc_21><loc_69></location>Implement roles and separation of duties</paragraph>
<paragraph><location><page_18><loc_4><loc_59><loc_20><loc_64></location>Leverage row permissions on the database</paragraph>
<paragraph><location><page_18><loc_25><loc_59><loc_68><loc_69></location>This IBM Redpaper publication provides information about the IBM i 7.2 feature of IBM DB2 for i Row and Column Access Control (RCAC). It offers a broad description of the function and advantages of controlling access to data in a comprehensive and transparent way. This publication helps you understand the capabilities of RCAC and provides examples of defining, creating, and implementing the row permissions and column masks in a relational database environment.</paragraph>
<paragraph><location><page_18><loc_4><loc_52><loc_20><loc_57></location>Protect columns by defining column masks</paragraph>
<paragraph><location><page_18><loc_25><loc_51><loc_68><loc_58></location>This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.</paragraph>
<subtitle-level-1><location><page_18><loc_4><loc_82><loc_72><loc_91></location>RowandColumnAccessControl Support in IBM DB2 for i</subtitle-level-1>
<paragraph><location><page_18><loc_4><loc_66><loc_20><loc_69></location>Implement roles and separation of duties</paragraph>
<paragraph><location><page_18><loc_4><loc_59><loc_19><loc_64></location>Leverage row permissions on the database</paragraph>
<paragraph><location><page_18><loc_25><loc_59><loc_67><loc_69></location>This IBM Redpaper publication provides information about the IBM i 7.2 feature of IBM DB2 for i Row and Column Access Control (RCAC). It offers a broad description of the function and advantages of controlling access to data in a comprehensive and transparent way. This publication helps you understand the capabilities of RCAC and provides examples of defining, creating, and implementing the row permissions and column masks in a relational database environment.</paragraph>
<paragraph><location><page_18><loc_4><loc_52><loc_19><loc_57></location>Protect columns by defining column masks</paragraph>
<paragraph><location><page_18><loc_25><loc_51><loc_67><loc_58></location>This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.</paragraph>
<figure>
<location><page_18><loc_79><loc_93><loc_93><loc_97></location>
</figure>

File diff suppressed because one or more lines are too long

View File

@ -2,7 +2,7 @@ Front cover
<!-- image -->
## Row and Column Access Control Support in IBM DB2 for i
## RowandColumnAccessControl Support in IBM DB2 for i
<!-- image -->
@ -20,7 +20,7 @@ Solution Brief IBM Systems Lab Services and Training
- /g115/g3 /g40/g81/g75/g68/g81/g70/g72/g3 /g87/g75/g72/g3 /g83/g72/g85/g73/g82/g85/g80/g68/g81/g70/g72/g3 /g82/g73/g3 /g92/g82/g88/g85/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g82/g83/g72/g85/g68/g87/g76/g82/g81/g86
- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85 /g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86
- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85/g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86
- /g115/g3 /g53/g72/g79/g92/g3 /g82/g81/g3 /g44/g37/g48/g3 /g72/g91/g83/g72/g85/g87/g3 /g70/g82/g81/g86/g88/g79/g87/g76/g81/g74/g15/g3 /g86/g78/g76/g79/g79/g86/g3 /g86/g75/g68/g85/g76/g81/g74/g3 /g68/g81/g71/g3 /g85/g72/g81/g82/g90/g81/g3 /g86/g72/g85/g89/g76/g70/g72/g86
@ -52,7 +52,7 @@ Global CoE engagements cover topics including:
- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Business intelligence and analytics
- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> DB2 Web Query
- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> DB2Web Query
- rglyph<c=1,font=/NKDKKL+JansonTextLTStd-Roman> Query/400 modernization for better reporting and analysis capabilities
@ -90,7 +90,7 @@ Chapter 1.
## Securing and protecting IBM DB2 data
Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.
Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.
Businesses must make a serious effort to secure their data and recognize that securing information assets is a cost of doing business. In many parts of the world and in many industries, securing the data is required by law and subject to audits. Data security is no longer an option; it is a requirement.
@ -102,7 +102,7 @@ This chapter describes how you can secure and protect data in DB2 for i. The fol
- /SM590000 DB2 for i security controls
## 1.1 Security fundamentals
## 1.1 Security fundamentals
Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:
@ -116,7 +116,7 @@ A security policy is what defines whether the system and its settings are secure
With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.
## 1.2 Current state of IBM i security
## 1.2 Current state of IBM i security
Because of the inherently secure nature of IBM i, many clients rely on the default system settings to protect their business data that is stored in DB2 for i. In most cases, this means no data protection because the default setting for the Create default public authority (QCRTAUT) system value is *CHANGE.
@ -126,7 +126,7 @@ Traditionally, IBM i applications have employed menu-based security to counterac
Many businesses are trying to limit data access to a need-to-know basis. This security goal means that users should be given access only to the minimum set of data that is required to perform their job. Often, users with object-level access are given access to row and column values that are beyond what their business task requires because that object-level security provides an all-or-nothing solution. For example, object-level controls allow a manager to access data about all employees. Most security policies limit a manager to accessing data only for the employees that they manage.
## 1.3.1 Existing row and column control
## 1.3.1 Existing row and column control
Some IBM i clients have tried augmenting the all-or-nothing object-level security with SQL views (or logical files) and application logic, as shown in Figure 1-2. However, application-based logic is easy to bypass with all of the different data access interfaces that are provided by the IBM i operating system, such as Open Database Connectivity (ODBC) and System i Navigator.
@ -134,10 +134,10 @@ Using SQL views to limit access to a subset of the data in a table also has its
Even if you are willing to live with these performance and management issues, a user with *ALLOBJ access still can directly access all of the data in the underlying DB2 table and easily bypass the security controls that are built into an SQL view.
Figure 1-2 Existing row and column controls
Figure 1-2 Existing row and column controls
<!-- image -->
## 2.1.6 Change Function Usage CL command
## 2.1.6 Change Function Usage CL command
The following CL commands can be used to work with, display, or change function usage IDs:
@ -151,24 +151,26 @@ For example, the following CHGFCNUSG command shows granting authorization to use
CHGFCNUSG FCNID(QIBM_DB_SECADM) USER(HBEDOYA) USAGE(*ALLOWED)
## 2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view
## 2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view
The FUNCTION_USAGE view contains function usage configuration details. Table 2-1 describes the columns in the FUNCTION_USAGE view.
Table 2-1 FUNCTION_USAGE view
Table 2-1 FUNCTION_USAGE view
| Column name | Data type | Description |
|---------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| FUNCTION_ID | VARCHAR(30) | ID of the function. |
| USER_NAME | VARCHAR(10) | Name of the user profile that has a usage setting for this function. |
| USER_NAME | VARCHAR(10) | Name of the user profile that has a usage setting for this function. |
| USAGE | VARCHAR(7) | Usage setting: /SM590000 ALLOWED: The user profile is allowed to use the function. /SM590000 DENIED: The user profile is not allowed to use the function. |
| USER_TYPE | VARCHAR(5) | Type of user profile: /SM590000 USER: The user profile is a user. /SM590000 GROUP: The user profile is a group. |
To discover who has authorization to define and manage RCAC, you can use the query that is shown in Example 2-1.
Example 2-1 Query to determine who has authority to define and manage RCAC
Example 2-1 Query to determine who has authority to define and manage RCAC
SELECT function_id,
SELECT
function_id,
user_name,
@ -176,13 +178,19 @@ usage,
user_type
FROM function_usage
FROM
WHERE function_id='QIBM_DB_SECADM'
function_usage
ORDER BY user_name;
WHERE
## 2.2 Separation of duties
function_id='QIBM_DB_SECADM'
ORDER BY
user_name;
## 2.2 Separation of duties
Separation of duties helps businesses comply with industry regulations or organizational requirements and simplifies the management of authorities. Separation of duties is commonly used to prevent fraudulent activities or errors by a single person. It provides the ability for administrative functions to be divided across individuals without overlapping responsibilities, so that one user does not possess unlimited authority, such as with the *ALLOBJ authority.
@ -198,26 +206,26 @@ A preferred practice is that the RCAC administrator has the QIBM_DB_SECADM funct
Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL authority to the different CL commands and DB2 for i tools.
Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority
Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority
| User action | *JOBCTL | QIBM_DB_SECADM | QIBM_DB_SQLADM | QIBM_DB_SYSMON | No Authority |
|--------------------------------------------------------------------------------|-----------|------------------|------------------|------------------|----------------|
| SET CURRENT DEGREE (SQL statement) | X | | X | | |
| CHGQRYA command targeting a different user's job | X | | X | | |
| STRDBMON or ENDDBMON commands targeting a different user's job | X | | X | | |
| STRDBMON or ENDDBMON commands targeting a job that matches the current user | X | | X | X | X |
| QUSRJOBI() API format 900 or System i Navigator's SQL Details for Job | X | | X | X | |
| Visual Explain within Run SQL scripts | X | | X | X | X |
| Visual Explain outside of Run SQL scripts | X | | X | | |
| ANALYZE PLAN CACHE procedure | X | | X | | |
| DUMP PLAN CACHE procedure | X | | X | | |
| MODIFY PLAN CACHE procedure | X | | X | | |
| MODIFY PLAN CACHE PROPERTIES procedure (currently does not check authority) | X | | X | | |
| CHANGE PLAN CACHE SIZE procedure (currently does not check authority) | X | | X | | |
| User action | *JOBCTL | QIBM_DB_SECADM | QIBM_DB_SQLADM | QIBM_DB_SYSMON | No Authority |
|-----------------------------------------------------------------------------|-----------|------------------|------------------|------------------|----------------|
| SET CURRENT DEGREE (SQL statement) | X | | X | | |
| CHGQRYA command targeting a different user's job | X | | X | | |
| STRDBMON or ENDDBMON commands targeting a different user's job | X | | X | | |
| STRDBMON or ENDDBMON commands targeting a job that matches the current user | X | | X | X | X |
| QUSRJOBI() API format 900 or System i Navigator's SQL Details for Job | X | | X | X | |
| Visual Explain within Run SQL scripts | X | | X | X | X |
| Visual Explain outside of Run SQL scripts | X | | X | | |
| ANALYZE PLAN CACHE procedure | X | | X | | |
| DUMP PLAN CACHE procedure | X | | X | | |
| MODIFY PLAN CACHE procedure | X | | X | | |
| MODIFY PLAN CACHE PROPERTIES procedure (currently does not check authority) | X | | X | | |
| CHANGE PLAN CACHE SIZE procedure (currently does not check authority) | X | | X | | |
The SQL CREATE PERMISSION statement that is shown in Figure 3-1 is used to define and initially enable or disable the row access rules.
Figure 3-1 CREATE PERMISSION SQL statement
Figure 3-1 CREATE PERMISSION SQL statement
<!-- image -->
## Column mask
@ -226,13 +234,13 @@ A column mask is a database object that manifests a column value access control
Table 3-1 summarizes these special registers and their values.
Table 3-1 Special registers and their corresponding values
Table 3-1 Special registers and their corresponding values
| Special register | Corresponding value |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| USER or SESSION_USER | The effective user of the thread excluding adopted authority. |
| CURRENT_USER | The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER. |
| SYSTEM_USER | The authorization ID that initiated the connection. |
| Special register | Corresponding value |
|----------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| USER or SESSION_USER | The effective user of the thread excluding adopted authority. |
| CURRENT_USER | The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER. |
| SYSTEM_USER | The authorization ID that initiated the connection. |
Figure 3-5 shows the difference in the special register values when an adopted authority is used:
@ -246,10 +254,10 @@ Figure 3-5 shows the difference in the special register values when an adopted a
- /SM590000 When proc1 ends, the session reverts to its original state with both USER and CURRENT USER having the value of ALICE.
Figure 3-5 Special registers and adopted authority
Figure 3-5 Special registers and adopted authority
<!-- image -->
## 3.2.2 Built-in global variables
## 3.2.2 Built-in global variables
Built-in global variables are provided with the database manager and are used in SQL statements to retrieve scalar values that are associated with the variables.
@ -257,7 +265,7 @@ IBM DB2 for i supports nine different built-in global variables that are read on
Table 3-2 lists the nine built-in global variables.
Table 3-2 Built-in global variables
Table 3-2 Built-in global variables
| Global variable | Type | Description |
|-----------------------|--------------|----------------------------------------------------------------|
@ -271,7 +279,7 @@ Table 3-2 Built-in global variables
| ROUTINE_SPECIFIC_NAME | VARCHAR(128) | Name of the currently running routine |
| ROUTINE_TYPE | CHAR(1) | Type of the currently running routine |
## 3.3 VERIFY_GROUP_FOR_USER function
## 3.3 VERIFY_GROUP_FOR_USER function
The VERIFY_GROUP_FOR_USER function was added in IBM i 7.2. Although it is primarily intended for use with RCAC permissions and masks, it can be used in other SQL statements. The first parameter must be one of these three special registers: SESSION_USER, USER, or CURRENT_USER. The second and subsequent parameters are a list of user or group profiles. Each of these values must be 1 - 10 characters in length. These values are not validated for their existence, which means that you can specify the names of user profiles that do not exist without receiving any kind of error.
@ -285,13 +293,15 @@ Here is an example of using the VERIFY_GROUP_FOR_USER function:
- 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:
VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', 'STEVE') The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')
VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')
'STEVE')
RETURN
CASE
WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;
WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;
- 2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:
@ -305,53 +315,65 @@ WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . D
- To implement this column mask, run the SQL statement that is shown in Example 3-9.
CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;
CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;
Example 3-9 Creating a mask on the TAX_ID column
Example 3-9 Creating a mask on the TAX_ID column
- 3. Figure 3-10 shows the masks that are created in the HR_SCHEMA.
Figure 3-10 Column masks shown in System i Navigator
Figure 3-10 Column masks shown in System i Navigator
<!-- image -->
## 3.6.6 Activating RCAC
## 3.6.6 Activating RCAC
Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:
- 1. Run the SQL statements that are shown in Example 3-10.
## Example 3-10 Activating RCAC on the EMPLOYEES table
## Example 3-10 Activating RCAC on the EMPLOYEES table
- /* Active Row Access Control (permissions) */
- /* Active Row Access Control (permissions)
- /* Active Column Access Control (masks) */
*/
ALTER TABLE HR_SCHEMA.EMPLOYEES
- /* Active Column Access Control (masks)
ACTIVATE ROW ACCESS CONTROL
ALTER
TABLE HR_SCHEMA.EMPLOYEES
ACTIVATE ROW
ACCESS
CONTROL
ACTIVATE COLUMN ACCESS CONTROL;
- 2. Look at the definition of the EMPLOYEE table, as shown in Figure 3-11. To do this, from the main navigation pane of System i Navigator, click Schemas  HR_SCHEMA  Tables , right-click the EMPLOYEES table, and click Definition .
Figure 3-11 Selecting the EMPLOYEES table from System i Navigator
Figure 3-11 Selecting the EMPLOYEES table from System i Navigator
<!-- image -->
*/
- 2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.
- 3. Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.
Figure 4-68 Visual Explain with RCAC enabled
Figure 4-68 Visual Explain with RCAC enabled
<!-- image -->
Figure 4-69 Index advice with no RCAC
Figure 4-69 Index advice with no RCAC
<!-- image -->
THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;
THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;
C
Back cover
## Row and Column Access Control Support in IBM DB2 for i
## RowandColumnAccessControl Support in IBM DB2 for i
Implement roles and separation of duties

File diff suppressed because one or more lines are too long

View File

@ -1,7 +1,7 @@
<document>
<subtitle-level-1><location><page_1><loc_37><loc_89><loc_85><loc_90></location>تحسين الإنتاجية وحل المشكلات من خلال البرمجة بلغة R و Python</subtitle-level-1>
<paragraph><location><page_1><loc_15><loc_80><loc_85><loc_87></location>تعتبر البرمجة بلغة R و Python من الأدوات القوية التي يمكن أن تعزز الإنتاجية وتساعد في إيجاد حلول فعالة للمشكلات. يمتلك كل من R و Python ميزات فريدة تجعلها مثالية لتحليل البيانات، مما يسهل على المحللين والعلماء إجراء تحليلات معقدة بطريقة سريعة وفعالة. إذا كان لديك عقلية تحليلية، فإن استخدام هذه اللغات يمكن أن يسهم بشكل كبير في تحسين نتائج العمل .</paragraph>
<paragraph><location><page_1><loc_16><loc_72><loc_85><loc_78></location>عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .</paragraph>
<paragraph><location><page_1><loc_15><loc_63><loc_85><loc_69></location>علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم البياني والتحليل الإ حصائي، مما يجعلها مثالية للباحثين والمحللين .</paragraph>
<paragraph><location><page_1><loc_17><loc_72><loc_85><loc_78></location>عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .</paragraph>
<paragraph><location><page_1><loc_16><loc_63><loc_85><loc_69></location>علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم الإ البياني والتحليل حصائي، مما يجعلها مثالية للباحثين والمحللين .</paragraph>
<paragraph><location><page_1><loc_16><loc_56><loc_85><loc_61></location>في النهاية، يمكن أن تؤدي البرمجة بلغة R و Python مع عقلية تحليلية إلى تحسين الإنتاجية وتوفير حلول مبتكرة للمشكلات المعقدة. إن القدرة على تحليل البيانات بشكل فعال وتطبيق الأساليب البرمجية المناسبة يمكن أن تكون له ا تأثيرات إيجابية بعيدة المدى على الأداء الشخصي والمهني .</paragraph>
</document>

File diff suppressed because one or more lines are too long

View File

@ -2,8 +2,8 @@
تعتبر البرمجة بلغة R و Python من الأدوات القوية التي يمكن أن تعزز الإنتاجية وتساعد في إيجاد حلول فعالة للمشكلات. يمتلك كل من R و Python ميزات فريدة تجعلها مثالية لتحليل البيانات، مما يسهل على المحللين والعلماء إجراء تحليلات معقدة بطريقة سريعة وفعالة. إذا كان لديك عقلية تحليلية، فإن استخدام هذه اللغات يمكن أن يسهم بشكل كبير في تحسين نتائج العمل .
عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .
عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .
علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم البياني والتحليل الإ حصائي، مما يجعلها مثالية للباحثين والمحللين .
علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم الإ البياني والتحليل حصائي، مما يجعلها مثالية للباحثين والمحللين .
في النهاية، يمكن أن تؤدي البرمجة بلغة R و Python مع عقلية تحليلية إلى تحسين الإنتاجية وتوفير حلول مبتكرة للمشكلات المعقدة. إن القدرة على تحليل البيانات بشكل فعال وتطبيق الأساليب البرمجية المناسبة يمكن أن تكون له ا تأثيرات إيجابية بعيدة المدى على الأداء الشخصي والمهني .

File diff suppressed because one or more lines are too long

View File

@ -1,9 +1,9 @@
<document>
<paragraph><location><page_1><loc_8><loc_3><loc_10><loc_4></location>11</paragraph>
<paragraph><location><page_1><loc_11><loc_50><loc_73><loc_75></location>وعليه، فإن الحكومة المصرية تضع صوو عييهاوخ لوال المر اوة الم باوة تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح يق يدد من الأ هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح يووق معوودحت نمووو قويوووة ومسووولدامة و وووخماة فوووا عذاوووف ال لخيوووخت، و ووو ا الح وووخ ياوووى محوددات الأمون ال ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو ة، ومواصوواة واووود تلوووير الماووخر ة السيخسووية، واسوولمرار ملخبعووة ما و و خت الأمووون واحسووول رار ومكخفحوووة الإرهوووخ ، تلووووير ما وووخت ال خفوووة والوووويا الوووو،ها، والبلوووخ الوووديها المعلووودل ياوووى الهحوووو الووو ي يرسووو م وووخهيل الموا،هة والسام المجلمعا .</paragraph>
<paragraph><location><page_1><loc_13><loc_45><loc_74><loc_48></location>ووف ًخ لمخ سبق، يسولاد برنوخما الحكوموة المصورية لوال ال لور ( 2024 -2026 ) تح يق عربعة عهدا اسلراتيجية رئيسة، وها ياى الهحو الآتا :</paragraph>
<paragraph><location><page_1><loc_12><loc_37><loc_73><loc_40></location>مايــــــــة اممــــــــن القومي المصـر بنــــاء ا نســــا المصــــــــــــــــــــر بنـــــاء ا تصـــــاع تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي</paragraph>
<paragraph><location><page_1><loc_11><loc_23><loc_73><loc_31></location>تجدر الإ خر إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ د باوكل رئووويس ياوووى مسووولادفخت ر يوووة مصووور 2023 ، وتوصووويخت واسوووخت الحووووار الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا خت الايك ايوة، ومبلاف احسلراتيجيخت الو،هية .</paragraph>
<paragraph><location><page_1><loc_12><loc_50><loc_73><loc_75></location>وعليه، اوة الم عييهاوخ لوال المر فإن الحكومة المصرية تضع صوو باوة يق يدد من الأ تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، يووق معوودحت نمووو لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح ياوووى وووخ ا الح ووو لخيوووخت، و وووخماة فوووا عذاوووف ال قويوووة ومسووولدامة و ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو محوددات الأمون ال ة، و و ة السيخسووية، واسوولمرار ملخبعووة ما ومواصوواة واووود تلوووير الماووخر خت خفوووة والوووويا وووخت ال ، تلووووير ما رار ومكخفحوووة الإرهوووخ الأمووون واحسووول وووخهيل م ي يرسووو الووو الوووديها المعلووودل ياوووى الهحوووو الوووو،ها، والبلوووخ الموا،هة والسام المجلمعا .</paragraph>
<paragraph><location><page_1><loc_13><loc_45><loc_74><loc_48></location>لور برنوخما الحكوموة المصورية لوال ال ًخ لمخ سبق، يسولاد ووف ( 2024 -2026 ) اسلراتيجية رئيسة، وها ياى الهحو الآتا يق عربعة عهدا تح :</paragraph>
<paragraph><location><page_1><loc_12><loc_37><loc_73><loc_40></location>مايــــــــة اممــــــــن القومي المصـر نسـ ـــا بنــــاء ا المصـ ـــــــــــــــــــر تصـــــاع بنـــــاء ا تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي</paragraph>
<paragraph><location><page_1><loc_12><loc_23><loc_73><loc_31></location>إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ خر تجدر الإ د باوكل يوووة مصووور رئووويس ياوووى مسووولادفخت ر 2023 ، وتوصووويخت واسوووخت الحووووار خت الايك الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا ايوة، ومبلاف احسلراتيجيخت الو،هية .</paragraph>
<figure>
<location><page_1><loc_75><loc_23><loc_100><loc_76></location>
</figure>

File diff suppressed because one or more lines are too long

View File

@ -1,11 +1,11 @@
11
وعليه، فإن الحكومة المصرية تضع صوو عييهاوخ لوال المر اوة الم باوة تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح يق يدد من الأ هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح يووق معوودحت نمووو قويوووة ومسووولدامة و وووخماة فوووا عذاوووف ال لخيوووخت، و ووو ا الح وووخ ياوووى محوددات الأمون ال ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو ة، ومواصوواة واووود تلوووير الماووخر ة السيخسووية، واسوولمرار ملخبعووة ما و و خت الأمووون واحسووول رار ومكخفحوووة الإرهوووخ ، تلووووير ما وووخت ال خفوووة والوووويا الوووو،ها، والبلوووخ الوووديها المعلووودل ياوووى الهحوووو الووو ي يرسووو م وووخهيل الموا،هة والسام المجلمعا .
وعليه، اوة الم عييهاوخ لوال المر فإن الحكومة المصرية تضع صوو باوة يق يدد من الأ تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، يووق معوودحت نمووو لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح ياوووى وووخ ا الح ووو لخيوووخت، و وووخماة فوووا عذاوووف ال قويوووة ومسووولدامة و ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو محوددات الأمون ال ة، و و ة السيخسووية، واسوولمرار ملخبعووة ما ومواصوواة واووود تلوووير الماووخر خت خفوووة والوووويا وووخت ال ، تلووووير ما رار ومكخفحوووة الإرهوووخ الأمووون واحسووول وووخهيل م ي يرسووو الووو الوووديها المعلووودل ياوووى الهحوووو الوووو،ها، والبلوووخ الموا،هة والسام المجلمعا .
ووف ًخ لمخ سبق، يسولاد برنوخما الحكوموة المصورية لوال ال لور ( 2024 -2026 ) تح يق عربعة عهدا اسلراتيجية رئيسة، وها ياى الهحو الآتا :
لور برنوخما الحكوموة المصورية لوال ال ًخ لمخ سبق، يسولاد ووف ( 2024 -2026 ) اسلراتيجية رئيسة، وها ياى الهحو الآتا يق عربعة عهدا تح :
مايــــــــة اممــــــــن القومي المصـر بنــــاء ا نســــا المصــــــــــــــــــــر بنـــــاء ا تصـــــاع تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي
مايــــــــة اممــــــــن القومي المصـر نسـ ـــا بنــــاء ا المصـ ـــــــــــــــــــر تصـــــاع بنـــــاء ا تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي
تجدر الإ خر إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ د باوكل رئووويس ياوووى مسووولادفخت ر يوووة مصووور 2023 ، وتوصووويخت واسوووخت الحووووار الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا خت الايك ايوة، ومبلاف احسلراتيجيخت الو،هية .
إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ خر تجدر الإ د باوكل يوووة مصووور رئووويس ياوووى مسووولادفخت ر 2023 ، وتوصووويخت واسوووخت الحووووار خت الايك الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا ايوة، ومبلاف احسلراتيجيخت الو،هية .
<!-- image -->

File diff suppressed because one or more lines are too long

View File

@ -5,28 +5,28 @@
</figure>
<subtitle-level-1><location><page_1><loc_63><loc_81><loc_81><loc_84></location>2-5 -استاندارد ک الا</subtitle-level-1>
<paragraph><location><page_1><loc_77><loc_79><loc_87><loc_81></location>نام استاندارد</paragraph>
<paragraph><location><page_1><loc_11><loc_75><loc_44><loc_81></location>شمشه و شمشال توليد شده به روش ريخته گری پيوسته مورد مصرف در فولادهای سازه ای - مطابق آناليز پيوست</paragraph>
<paragraph><location><page_1><loc_12><loc_75><loc_44><loc_81></location>ريخته گری به روش شده توليد و شمشال شمشه پيوسته مورد مصرف سازه ای فولادهای در - مطابق آناليز پيوست</paragraph>
<paragraph><location><page_1><loc_71><loc_72><loc_87><loc_74></location>شماره استاندارد ملی</paragraph>
<paragraph><location><page_1><loc_40><loc_73><loc_45><loc_74></location>20300</paragraph>
<paragraph><location><page_1><loc_68><loc_70><loc_87><loc_72></location>استاندارد اجباری است؟</paragraph>
<paragraph><location><page_1><loc_65><loc_67><loc_87><loc_69></location>مرجع صادرکننده استاندارد</paragraph>
<paragraph><location><page_1><loc_28><loc_67><loc_44><loc_69></location>سازمان ملی استاندارد ايران</paragraph>
<paragraph><location><page_1><loc_49><loc_62><loc_87><loc_66></location>آيا توليدکننده محصول، استاندارد مذکور را اخذ نموده است؟</paragraph>
<paragraph><location><page_1><loc_50><loc_62><loc_87><loc_66></location>آيا توليدکننده محصول، استاندارد مذکور را اخذ نموده است؟</paragraph>
<subtitle-level-1><location><page_1><loc_69><loc_56><loc_85><loc_58></location>3 -پذيرش در بورس</subtitle-level-1>
<paragraph><location><page_1><loc_68><loc_54><loc_83><loc_56></location>تاريخ ارائه مدارک</paragraph>
<paragraph><location><page_1><loc_69><loc_54><loc_83><loc_56></location>تاريخ ارائه مدارک</paragraph>
<paragraph><location><page_1><loc_23><loc_54><loc_32><loc_56></location>19 / 09 / 1403</paragraph>
<paragraph><location><page_1><loc_72><loc_51><loc_83><loc_53></location>تاريخ پذيرش</paragraph>
<paragraph><location><page_1><loc_23><loc_51><loc_32><loc_53></location>04 / 10 / 1403</paragraph>
<paragraph><location><page_1><loc_62><loc_48><loc_83><loc_50></location>شماره جلسه کميته عرضه</paragraph>
<paragraph><location><page_1><loc_26><loc_49><loc_29><loc_50></location>436</paragraph>
<paragraph><location><page_1><loc_67><loc_45><loc_83><loc_47></location>تاريخ درج اميدنامه</paragraph>
<paragraph><location><page_1><loc_68><loc_45><loc_83><loc_47></location>تاريخ درج اميدنامه</paragraph>
<paragraph><location><page_1><loc_23><loc_46><loc_32><loc_48></location>05 / 10 / 1403</paragraph>
<paragraph><location><page_1><loc_71><loc_43><loc_83><loc_45></location>مشاور پذيرش</paragraph>
<paragraph><location><page_1><loc_72><loc_43><loc_83><loc_45></location>مشاور پذيرش</paragraph>
<paragraph><location><page_1><loc_21><loc_43><loc_34><loc_45></location>کارگزاری آ رمون بورس</paragraph>
<paragraph><location><page_1><loc_47><loc_37><loc_83><loc_42></location>نحوة تعيين قيمت پايه پس از پذيرش کالا در بورس</paragraph>
<paragraph><location><page_1><loc_18><loc_40><loc_36><loc_42></location>بر اساس قيمت های جهانی</paragraph>
<paragraph><location><page_1><loc_48><loc_37><loc_83><loc_42></location>نحوة تعيين قيمت پايهپس از پذيرش کالا در بورس</paragraph>
<paragraph><location><page_1><loc_19><loc_40><loc_36><loc_42></location>بر اساس قيمت های جهانی</paragraph>
<paragraph><location><page_1><loc_45><loc_32><loc_83><loc_37></location>حداقل درصد عرضه از توليد / کل فروش / فروش داخلی</paragraph>
<paragraph><location><page_1><loc_14><loc_35><loc_40><loc_37></location>حداقل 50 % از توليد ساليانه يا 47.500 تن</paragraph>
<paragraph><location><page_1><loc_14><loc_35><loc_40><loc_37></location>حداقل 50 % يا از توليد ساليانه 47.500 تن</paragraph>
<paragraph><location><page_1><loc_68><loc_29><loc_83><loc_31></location>خطای مجاز تحويل</paragraph>
<paragraph><location><page_1><loc_18><loc_30><loc_37><loc_31></location>5% آخرين محموله قابل تحويل</paragraph>
</document>

File diff suppressed because one or more lines are too long

View File

@ -6,7 +6,7 @@
نام استاندارد
شمشه و شمشال توليد شده به روش ريخته گری پيوسته مورد مصرف در فولادهای سازه ای - مطابق آناليز پيوست
ريخته گری به روش شده توليد و شمشال شمشه پيوسته مورد مصرف سازه ای فولادهای در - مطابق آناليز پيوست
شماره استاندارد ملی
@ -42,13 +42,13 @@
کارگزاری آ رمون بورس
نحوة تعيين قيمت پايه پس از پذيرش کالا در بورس
نحوة تعيين قيمت پايهپس از پذيرش کالا در بورس
بر اساس قيمت های جهانی
حداقل درصد عرضه از توليد / کل فروش / فروش داخلی
حداقل 50 % از توليد ساليانه يا 47.500 تن
حداقل 50 % يا از توليد ساليانه 47.500 تن
خطای مجاز تحويل

File diff suppressed because one or more lines are too long

View File

@ -1,4 +1,5 @@
<doctag><page_header><loc_15><loc_101><loc_30><loc_354>arXiv:2203.01017v2 [cs.CV] 11 Mar 2022</page_header>
<doctag><page_header><loc_15><loc_133><loc_30><loc_354>arXiv:2203.01017v2 [cs.CV] 11 Mar</page_header>
<text><loc_15><loc_101><loc_30><loc_126>2022</text>
<section_header_level_1><loc_79><loc_68><loc_408><loc_76>TableFormer: Table Structure Understanding with Transformers.</section_header_level_1>
<section_header_level_1><loc_116><loc_93><loc_370><loc_108>Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research</section_header_level_1>
<text><loc_170><loc_111><loc_309><loc_116>{ ahn,nli,mly,taa @zurich.ibm.com }</text>
@ -30,7 +31,7 @@
</unordered_list>
<text><loc_41><loc_411><loc_234><loc_439>The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe</text>
<footnote><loc_50><loc_445><loc_150><loc_450>1 https://github.com/IBM/SynthTabNet</footnote>
<text><loc_252><loc_48><loc_445><loc_68>its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.</text>
<text><loc_252><loc_48><loc_445><loc_68>its results &performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.</text>
<section_header_level_1><loc_252><loc_77><loc_407><loc_84>2. Previous work and State of the Art</section_header_level_1>
<text><loc_252><loc_90><loc_445><loc_209>Identifying the structure of a table has been an outstanding problem in the document-parsing community, that motivates many organised public challenges [6, 4, 14]. The difficulty of the problem can be attributed to a number of factors. First, there is a large variety in the shapes and sizes of tables. Such large variety requires a flexible method. This is especially true for complex column- and row headers, which can be extremely intricate and demanding. A second factor of complexity is the lack of data with regard to table-structure. Until the publication of PubTabNet [37], there were no large datasets (i.e. > 100 K tables) that provided structure information. This happens primarily due to the fact that tables are notoriously time-consuming to annotate by hand. However, this has definitely changed in recent years with the deliverance of PubTabNet [37], FinTabNet [36], TableBank [17] etc.</text>
<text><loc_252><loc_211><loc_445><loc_284>Before the rising popularity of deep neural networks, the community relied heavily on heuristic and/or statistical methods to do table structure identification [3, 7, 11, 5, 13, 28]. Although such methods work well on constrained tables [12], a more data-driven approach can be applied due to the advent of convolutional neural networks (CNNs) and the availability of large datasets. To the best-of-our knowledge, there are currently two different types of network architecture that are being pursued for state-of-the-art tablestructure identification.</text>
@ -45,7 +46,7 @@
<text><loc_41><loc_415><loc_234><loc_450>We rely on large-scale datasets such as PubTabNet [37], FinTabNet [36], and TableBank [17] datasets to train and evaluate our models. These datasets span over various appearance styles and content. We also introduce our own synthetically generated SynthTabNet dataset to fix an im-</text>
<picture><loc_255><loc_50><loc_450><loc_158><caption><loc_252><loc_169><loc_445><loc_182>Figure 2: Distribution of the tables across different table dimensions in PubTabNet + FinTabNet datasets</caption></picture>
<text><loc_252><loc_201><loc_357><loc_206>balance in the previous datasets.</text>
<text><loc_252><loc_209><loc_445><loc_396>The PubTabNet dataset contains 509k tables delivered as annotated PNG images. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98% and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDF documents with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.</text>
<text><loc_252><loc_209><loc_445><loc_396>The PubTabNet dataset contains 509k tables delivered as annotated PNGimages. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98%and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDFdocuments with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.</text>
<text><loc_252><loc_399><loc_445><loc_450>Due to the heterogeneity across the dataset formats, it was necessary to combine all available data into one homogenized dataset before we could train our models for practical purposes. Given the size of PubTabNet, we adopted its annotation format and we extracted and converted all tables as PNG images with a resolution of 72 dpi. Additionally, we have filtered out tables with extreme sizes due to small</text>
<page_footer><loc_241><loc_464><loc_245><loc_469>3</page_footer>
<page_break>
@ -69,7 +70,7 @@
<text><loc_252><loc_158><loc_445><loc_186>forming classification, and adding an adaptive pooling layer of size 28*28. ResNet by default downsamples the image resolution by 32 and then the encoded image is provided to both the Structure Decoder , and Cell BBox Decoder .</text>
<text><loc_252><loc_188><loc_445><loc_261>Structure Decoder. The transformer architecture of this component is based on the work proposed in [31]. After extensive experimentation, the Structure Decoder is modeled as a transformer encoder with two encoder layers and a transformer decoder made from a stack of 4 decoder layers that comprise mainly of multi-head attention and feed forward layers. This configuration uses fewer layers and heads in comparison to networks applied to other problems (e.g. 'Scene Understanding', 'Image Captioning'), something which we relate to the simplicity of table images.</text>
<text><loc_252><loc_263><loc_445><loc_344>The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.</text>
<text><loc_252><loc_346><loc_445><loc_412>Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.</text>
<text><loc_252><loc_346><loc_445><loc_412>Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTMLstructure tags become the object query.</text>
<text><loc_252><loc_414><loc_445><loc_450>The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-</text>
<page_footer><loc_241><loc_464><loc_245><loc_469>5</page_footer>
<page_break>
@ -119,7 +120,7 @@
<picture><loc_177><loc_240><loc_307><loc_280><caption><loc_41><loc_203><loc_445><loc_231>Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.</caption></picture>
<picture><loc_313><loc_241><loc_443><loc_280></picture>
<section_header_level_1><loc_41><loc_310><loc_134><loc_316>5.5. Qualitative Analysis</section_header_level_1>
<section_header_level_1><loc_252><loc_310><loc_377><loc_317>6. Future Work & Conclusion</section_header_level_1>
<section_header_level_1><loc_252><loc_310><loc_377><loc_317>6. Future Work &Conclusion</section_header_level_1>
<text><loc_41><loc_339><loc_234><loc_450>We showcase several visualizations for the different components of our network on various 'complex' tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.</text>
<text><loc_252><loc_324><loc_445><loc_412>In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce 'SynthTabNet' a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.</text>
<section_header_level_1><loc_252><loc_424><loc_298><loc_431>References</section_header_level_1>
@ -130,25 +131,25 @@
<list_item><loc_57><loc_48><loc_234><loc_74>end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</list_item>
<list_item><loc_45><loc_76><loc_234><loc_95>[2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</list_item>
<list_item><loc_45><loc_97><loc_234><loc_116>[3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</list_item>
<list_item><loc_45><loc_118><loc_234><loc_143>[4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</list_item>
<list_item><loc_45><loc_118><loc_234><loc_143>[4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</list_item>
<list_item><loc_45><loc_146><loc_234><loc_171>[5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2</list_item>
<list_item><loc_45><loc_174><loc_234><loc_199>[6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</list_item>
<list_item><loc_45><loc_174><loc_234><loc_199>[6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</list_item>
<list_item><loc_45><loc_201><loc_234><loc_220>[7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2</list_item>
<list_item><loc_45><loc_222><loc_234><loc_255>[8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1</list_item>
<list_item><loc_45><loc_257><loc_234><loc_276>[9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1</list_item>
<list_item><loc_41><loc_278><loc_234><loc_304>[10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2</list_item>
<list_item><loc_41><loc_306><loc_234><loc_339>[11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2</list_item>
<list_item><loc_41><loc_341><loc_234><loc_373>[12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2</list_item>
<list_item><loc_41><loc_376><loc_234><loc_408>[13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</list_item>
<list_item><loc_41><loc_376><loc_234><loc_408>[13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</list_item>
<list_item><loc_41><loc_410><loc_234><loc_429>[14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2</list_item>
<list_item><loc_41><loc_431><loc_234><loc_450>[15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</list_item>
<list_item><loc_41><loc_431><loc_234><loc_450>[15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</list_item>
<list_item><loc_252><loc_48><loc_445><loc_88>[16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4</list_item>
<list_item><loc_252><loc_90><loc_445><loc_109>[17] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition, 2019. 2, 3</list_item>
<list_item><loc_252><loc_111><loc_445><loc_164>[18] Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction. In Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, editors, Pattern Recognition. ICPR International Workshops and Challenges , pages 644-658, Cham, 2021. Springer International Publishing. 2, 3</list_item>
<list_item><loc_252><loc_167><loc_445><loc_206>[19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1</list_item>
<list_item><loc_252><loc_208><loc_445><loc_234>[20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2</list_item>
<list_item><loc_252><loc_236><loc_445><loc_276>[21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1</list_item>
<list_item><loc_252><loc_278><loc_445><loc_352>[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</list_item>
<list_item><loc_252><loc_278><loc_445><loc_352>[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</list_item>
<list_item><loc_252><loc_355><loc_445><loc_394>[23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1</list_item>
<list_item><loc_252><loc_396><loc_445><loc_422>[24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3</list_item>
<list_item><loc_252><loc_424><loc_445><loc_450>[25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on</list_item>
@ -160,7 +161,7 @@
<list_item><loc_41><loc_104><loc_234><loc_143>[27] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) , volume 1, pages 1162-1167. IEEE, 2017. 3</list_item>
<list_item><loc_41><loc_146><loc_234><loc_171>[28] Faisal Shafait and Ray Smith. Table detection in heterogeneous documents. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , pages 6572, 2010. 2</list_item>
<list_item><loc_41><loc_174><loc_234><loc_206>[29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3</list_item>
<list_item><loc_41><loc_208><loc_234><loc_241>[30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1</list_item>
<list_item><loc_41><loc_208><loc_234><loc_241>[30] Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1</list_item>
<list_item><loc_41><loc_243><loc_234><loc_290>[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5</list_item>
<list_item><loc_41><loc_292><loc_234><loc_317>[32] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015. 2</list_item>
<list_item><loc_41><loc_320><loc_234><loc_345>[33] Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: reconstruct syntactic structures from table images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 749-755. IEEE, 2019. 3</list_item>
@ -176,7 +177,7 @@
<section_header_level_1><loc_109><loc_70><loc_380><loc_86>TableFormer: Table Structure Understanding with Transformers Supplementary Material</section_header_level_1>
<section_header_level_1><loc_41><loc_102><loc_144><loc_109>1. Details on the datasets</section_header_level_1>
<section_header_level_1><loc_41><loc_114><loc_123><loc_120>1.1. Data preparation</section_header_level_1>
<text><loc_41><loc_126><loc_234><loc_245>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.</text>
<text><loc_41><loc_126><loc_234><loc_245>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). Atable is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTMLstructure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.</text>
<text><loc_41><loc_247><loc_234><loc_396>We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</text>
<text><loc_41><loc_398><loc_234><loc_411>Figure 7 illustrates the distribution of the tables across different dimensions per dataset.</text>
<section_header_level_1><loc_41><loc_418><loc_125><loc_424>1.2. Synthetic datasets</section_header_level_1>

File diff suppressed because one or more lines are too long

View File

@ -1,3 +1,5 @@
2022
## TableFormer: Table Structure Understanding with Transformers.
## Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research
@ -51,7 +53,7 @@ To meet the design criteria listed above, we developed a new model called TableF
The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe
its results &amp; performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
its results &amp;performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
## 2. Previous work and State of the Art
@ -79,7 +81,7 @@ Figure 2: Distribution of the tables across different table dimensions in PubTab
balance in the previous datasets.
The PubTabNet dataset contains 509k tables delivered as annotated PNG images. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98% and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDF documents with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.
The PubTabNet dataset contains 509k tables delivered as annotated PNGimages. The annotations consist of the table structure represented in HTML format, the tokenized text and its bounding boxes per table cell. Fig. 1 shows the appearance style of PubTabNet. Depending on its complexity, a table is characterized as 'simple' when it does not contain row spans or column spans, otherwise it is 'complex'. The dataset is divided into Train and Val splits (roughly 98%and 2%). The Train split consists of 54% simple and 46% complex tables and the Val split of 51% and 49% respectively. The FinTabNet dataset contains 112k tables delivered as single-page PDFdocuments with mixed table structures and text content. Similarly to the PubTabNet, the annotations of FinTabNet include the table structure in HTML, the tokenized text and the bounding boxes on a table cell basis. The dataset is divided into Train, Test and Val splits (81%, 9.5%, 9.5%), and each one is almost equally divided into simple and complex tables (Train: 48% simple, 52% complex, Test: 48% simple, 52% complex, Test: 53% simple, 47% complex). Finally the TableBank dataset consists of 145k tables provided as JPEG images. The latter has annotations for the table structure, but only few with bounding boxes of the table cells. The entire dataset consists of simple tables and it is divided into 90% Train, 3% Test and 7% Val splits.
Due to the heterogeneity across the dataset formats, it was necessary to combine all available data into one homogenized dataset before we could train our models for practical purposes. Given the size of PubTabNet, we adopted its annotation format and we extracted and converted all tables as PNG images with a resolution of 72 dpi. Additionally, we have filtered out tables with extreme sizes due to small
@ -132,7 +134,7 @@ Structure Decoder. The transformer architecture of this component is based on th
The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the &lt; td &gt; ' and ' &lt; ' HTML structure tags become the object query.
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the &lt; td &gt; ' and ' &lt; ' HTMLstructure tags become the object query.
The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-
@ -269,7 +271,7 @@ Figure 5: One of the benefits of TableFormer is that it is language agnostic, as
## 5.5. Qualitative Analysis
## 6. Future Work &amp; Conclusion
## 6. Future Work &amp;Conclusion
We showcase several visualizations for the different components of our network on various 'complex' tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.
@ -282,25 +284,25 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5
- [2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3
- [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2
- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
- [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2
- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
- [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2
- [8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1
- [9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1
- [10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2
- [11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2
- [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
- [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2
- [15] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
- [15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
- [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4
- [17] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition, 2019. 2, 3
- [18] Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and Xianhui Liu. Gfte: Graph-based financial table extraction. In Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani, editors, Pattern Recognition. ICPR International Workshops and Challenges , pages 644-658, Cham, 2021. Springer International Publishing. 2, 3
- [19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1
- [20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2
- [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
- [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1
- [24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3
- [25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on
@ -311,7 +313,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
- [27] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) , volume 1, pages 1162-1167. IEEE, 2017. 3
- [28] Faisal Shafait and Ray Smith. Table detection in heterogeneous documents. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , pages 6572, 2010. 2
- [29] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tahseen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed. Deeptabstr: Deep learning based table structure recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1403-1409. IEEE, 2019. 3
- [30] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
- [30] Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD , KDD '18, pages 774-782, New York, NY, USA, 2018. ACM. 1
- [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. 5
- [32] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015. 2
- [33] Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: reconstruct syntactic structures from table images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 749-755. IEEE, 2019. 3
@ -328,7 +330,7 @@ Computer Vision and Pattern Recognition , pages 658-666, 2019. 6
## 1.1. Data preparation
As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.
As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). Atable is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTMLstructure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, i.e. tables where every row has exactly the same length.
We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.

File diff suppressed because one or more lines are too long

View File

@ -1,4 +1,5 @@
<doctag><page_header><loc_15><loc_104><loc_30><loc_350>arXiv:2206.01062v1 [cs.CV] 2 Jun 2022</page_header>
<doctag><page_header><loc_15><loc_136><loc_30><loc_350>arXiv:2206.01062v1 [cs.CV] 2 Jun</page_header>
<text><loc_15><loc_104><loc_30><loc_129>2022</text>
<section_header_level_1><loc_88><loc_53><loc_413><loc_75>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</section_header_level_1>
<text><loc_74><loc_85><loc_158><loc_114>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</text>
<text><loc_208><loc_85><loc_292><loc_114>Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com</text>
@ -23,7 +24,7 @@
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
<section_header_level_1><loc_44><loc_55><loc_128><loc_61>1 INTRODUCTION</section_header_level_1>
<text><loc_44><loc_70><loc_248><loc_144>Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.</text>
<text><loc_44><loc_146><loc_241><loc_317>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</text>
<text><loc_44><loc_146><loc_241><loc_317>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</text>
<text><loc_44><loc_319><loc_241><loc_366>In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:</text>
<unordered_list><list_item><loc_53><loc_369><loc_241><loc_388>(1) Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.</list_item>
<list_item><loc_53><loc_390><loc_240><loc_402>(2) Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.</list_item>
@ -37,10 +38,10 @@
<text><loc_259><loc_106><loc_457><loc_139>All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.</text>
<text><loc_260><loc_141><loc_457><loc_194>In Section 5, we will present baseline accuracy numbers for a variety of object detection methods (Faster R-CNN, Mask R-CNN and YOLOv5) trained on DocLayNet. We further show how the model performance is impacted by varying the DocLayNet dataset size, reducing the label set and modifying the train/test-split. Last but not least, we compare the performance of models trained on PubLayNet, DocBank and DocLayNet and demonstrate that a model trained on DocLayNet provides overall more robust layout recovery.</text>
<section_header_level_1><loc_260><loc_203><loc_345><loc_209>2 RELATED WORK</section_header_level_1>
<text><loc_259><loc_219><loc_457><loc_293>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</text>
<text><loc_259><loc_219><loc_457><loc_293>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most commonapproach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</text>
<text><loc_260><loc_295><loc_457><loc_348>Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.</text>
<section_header_level_1><loc_260><loc_357><loc_390><loc_363>3 THE DOCLAYNET DATASET</section_header_level_1>
<text><loc_260><loc_373><loc_457><loc_426>DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</text>
<text><loc_260><loc_373><loc_457><loc_426>DocLayNet contains 80863 PDF pages. Amongthese, 7059 carry two instances of humanannotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</text>
<text><loc_260><loc_428><loc_456><loc_447>In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents</text>
<page_break>
<page_header><loc_44><loc_38><loc_284><loc_43>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</page_header>
@ -58,12 +59,12 @@
<text><loc_260><loc_399><loc_457><loc_446>The annotation campaign was carried out in four phases. In phase one, we identified and prepared the data sources for annotation. In phase two, we determined the class labels and how annotations should be done on the documents in order to obtain maximum consistency. The latter was guided by a detailed requirement analysis and exhaustive experiments. In phase three, we trained the annotation staff and performed exams for quality assurance. In phase four,</text>
<page_break>
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><ched>triple inter-annotator mAP @0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
<picture><loc_43><loc_196><loc_242><loc_341><caption><loc_44><loc_350><loc_242><loc_383>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption></picture>
<text><loc_44><loc_401><loc_240><loc_426>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</text>
<text><loc_44><loc_428><loc_241><loc_447><loc_44><loc_428><loc_241><loc_447>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
<text><loc_44><loc_428><loc_241><loc_447><loc_44><loc_428><loc_241><loc_447>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. Alarge effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
<text><loc_260><loc_239><loc_457><loc_320>Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.</text>
<text><loc_259><loc_322><loc_457><loc_437>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</text>
<text><loc_259><loc_322><loc_457><loc_437>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This wasachieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</text>
<footnote><loc_260><loc_443><loc_302><loc_447>3 https://arxiv.org/</footnote>
<page_break>
<page_header><loc_44><loc_38><loc_284><loc_43>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</page_header>
@ -84,15 +85,15 @@
<text><loc_327><loc_290><loc_389><loc_291>05237a14f2524e3f53c8454b074409d05078038a6a36b770fcc8ec7e540deae0</text>
<caption><loc_260><loc_299><loc_457><loc_318>Figure 4: Examples of plausible annotation alternatives for the same page. Criteria in our annotation guideline can resolve cases A to C, while the case D remains ambiguous.</caption>
<text><loc_259><loc_332><loc_456><loc_344>were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.</text>
<text><loc_259><loc_346><loc_457><loc_448>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted</text>
<text><loc_259><loc_346><loc_457><loc_448>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDFtext-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). Wewanted</text>
<page_break>
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
<text><loc_44><loc_55><loc_242><loc_115>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</text>
<text><loc_44><loc_55><loc_242><loc_115>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLOimplementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</text>
<otsl><loc_51><loc_124><loc_233><loc_222><ecel><ched>human<ched>MRCNN<lcel><ched>FRCNN<ched>YOLO<nl><ecel><ecel><ched>R50<ched>R101<ched>R101<ched>v5x6<nl><rhed>Caption<fcel>84-89<fcel>68.4<fcel>71.5<fcel>70.1<fcel>77.7<nl><rhed>Footnote<fcel>83-91<fcel>70.9<fcel>71.8<fcel>73.7<fcel>77.2<nl><rhed>Formula<fcel>83-85<fcel>60.1<fcel>63.4<fcel>63.5<fcel>66.2<nl><rhed>List-item<fcel>87-88<fcel>81.2<fcel>80.8<fcel>81.0<fcel>86.2<nl><rhed>Page-footer<fcel>93-94<fcel>61.6<fcel>59.3<fcel>58.9<fcel>61.1<nl><rhed>Page-header<fcel>85-89<fcel>71.9<fcel>70.0<fcel>72.0<fcel>67.9<nl><rhed>Picture<fcel>69-71<fcel>71.7<fcel>72.7<fcel>72.0<fcel>77.1<nl><rhed>Section-header<fcel>83-84<fcel>67.6<fcel>69.3<fcel>68.4<fcel>74.6<nl><rhed>Table<fcel>77-81<fcel>82.2<fcel>82.9<fcel>82.2<fcel>86.3<nl><rhed>Text<fcel>84-86<fcel>84.6<fcel>85.8<fcel>85.4<fcel>88.1<nl><rhed>Title<fcel>60-72<fcel>76.7<fcel>80.4<fcel>79.9<fcel>82.7<nl><rhed>All<fcel>82-83<fcel>72.4<fcel>73.5<fcel>73.4<fcel>76.8<nl></otsl>
<text><loc_44><loc_234><loc_241><loc_364>to avoid this at any cost in order to have clear, unbiased baseline numbers for human document-layout annotation. Third, we introduced the feature of snapping boxes around text segments to obtain a pixel-accurate annotation and again reduce time and effort. The CCS annotation tool automatically shrinks every user-drawn box to the minimum bounding-box around the enclosed text-cells for all purely text-based segments, which excludes only Table and Picture . For the latter, we instructed annotation staff to minimise inclusion of surrounding whitespace while including all graphical lines. A downside of snapping boxes to enclosed text cells is that some wrongly parsed PDF pages cannot be annotated correctly and need to be skipped. Fourth, we established a way to flag pages as rejected for cases where no valid annotation according to the label guidelines could be achieved. Example cases for this would be PDF pages that render incorrectly or contain layouts that are impossible to capture with non-overlapping rectangles. Such rejected pages are not contained in the final dataset. With all these measures in place, experienced annotation staff managed to annotate a single page in a typical timeframe of 20s to 60s, depending on its complexity.</text>
<section_header_level_1><loc_44><loc_372><loc_120><loc_378>5 EXPERIMENTS</section_header_level_1>
<text><loc_44><loc_387><loc_241><loc_448>The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</text>
<picture><loc_264><loc_57><loc_452><loc_164><caption><loc_260><loc_177><loc_457><loc_216>Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNet dataset with similar data will not yield significantly better predictions.</caption></picture>
<text><loc_44><loc_387><loc_241><loc_448>The primary goal of DocLayNet is to obtain high-quality MLmodels capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</text>
<picture><loc_264><loc_57><loc_452><loc_164><caption><loc_260><loc_177><loc_457><loc_216>Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNetdataset with similar data will not yield significantly better predictions.</caption></picture>
<text><loc_260><loc_243><loc_456><loc_255>paper and leave the detailed evaluation of more recent methods mentioned in Section 2 for future work.</text>
<text><loc_260><loc_257><loc_456><loc_303>In this section, we will present several aspects related to the performance of object detection models on DocLayNet. Similarly as in PubLayNet, we will evaluate the quality of their predictions using mean average precision (mAP) with 10 overlaps that range from 0.5 to 0.95 in steps of 0.05 (mAP@0.5-0.95). These scores are computed by leveraging the evaluation code provided by the COCO API [16].</text>
<section_header_level_1><loc_260><loc_314><loc_381><loc_320>Baselines for Object Detection</section_header_level_1>
@ -110,15 +111,15 @@
<otsl><loc_288><loc_95><loc_427><loc_193><fcel>Class-count<ched>11<lcel><ched>5<lcel><nl><fcel>Split<ched>Doc<ched>Page<ched>Doc<ched>Page<nl><rhed>Caption<fcel>68<fcel>83<ecel><ecel><nl><rhed>Footnote<fcel>71<fcel>84<ecel><ecel><nl><rhed>Formula<fcel>60<fcel>66<ecel><ecel><nl><rhed>List-item<fcel>81<fcel>88<fcel>82<fcel>88<nl><rhed>Page-footer<fcel>62<fcel>89<ecel><ecel><nl><rhed>Page-header<fcel>72<fcel>90<ecel><ecel><nl><rhed>Picture<fcel>72<fcel>82<fcel>72<fcel>82<nl><rhed>Section-header<fcel>68<fcel>83<fcel>69<fcel>83<nl><rhed>Table<fcel>82<fcel>89<fcel>82<fcel>90<nl><rhed>Text<fcel>85<fcel>91<fcel>84<fcel>90<nl><rhed>Title<fcel>77<fcel>81<ecel><ecel><nl><rhed>All<fcel>72<fcel>84<fcel>78<fcel>87<nl></otsl>
<text><loc_260><loc_209><loc_457><loc_263>lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items), the label set of size 4 is the closest to PubLayNet, in the assumption that the List is down-mapped to Text in PubLayNet. The results in Table 3 show that the prediction accuracy on the remaining class labels does not change significantly when other classes are merged into them. The overall macro-average improves by around 5%, in particular when Page-footer and Page-header are excluded.</text>
<section_header_level_1><loc_260><loc_272><loc_449><loc_277>Impact of Document Split in Train and Test Set</section_header_level_1>
<text><loc_259><loc_281><loc_457><loc_376>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</text>
<text><loc_259><loc_281><loc_457><loc_376>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</text>
<section_header_level_1><loc_260><loc_385><loc_342><loc_390>Dataset Comparison</section_header_level_1>
<text><loc_260><loc_394><loc_457><loc_447>Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,</text>
<page_break>
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
<text><loc_44><loc_55><loc_242><loc_95>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.</text>
<text><loc_44><loc_55><loc_242><loc_95>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.</text>
<otsl><loc_59><loc_109><loc_225><loc_215><ecel><ecel><ched>Testing on<lcel><lcel><nl><ched>Training on<ched>labels<ched>PLN<ched>DB<ched>DLN<nl><rhed>PubLayNet (PLN)<rhed>Figure<fcel>96<fcel>43<fcel>23<nl><ucel><rhed>Sec-header<fcel>87<fcel>-<fcel>32<nl><ecel><rhed>Table<fcel>95<fcel>24<fcel>49<nl><ecel><rhed>Text<fcel>96<fcel>-<fcel>42<nl><ecel><rhed>total<fcel>93<fcel>34<fcel>30<nl><rhed>DocBank (DB)<rhed>Figure<fcel>77<fcel>71<fcel>31<nl><ucel><rhed>Table<fcel>19<fcel>65<fcel>22<nl><ucel><rhed>total<fcel>48<fcel>68<fcel>27<nl><rhed>DocLayNet (DLN)<rhed>Figure<fcel>67<fcel>51<fcel>72<nl><ucel><rhed>Sec-header<fcel>53<fcel>-<fcel>68<nl><ecel><rhed>Table<fcel>87<fcel>43<fcel>82<nl><ecel><rhed>Text<fcel>77<fcel>-<fcel>84<nl><ecel><rhed>total<fcel>59<fcel>47<fcel>78<nl></otsl>
<text><loc_44><loc_247><loc_240><loc_280>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</text>
<text><loc_44><loc_282><loc_241><loc_370>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</text>
<text><loc_44><loc_282><loc_241><loc_370>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. Wehad to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</text>
<section_header_level_1><loc_44><loc_383><loc_127><loc_388>Example Predictions</section_header_level_1>
<text><loc_44><loc_392><loc_241><loc_445>To conclude this section, we illustrate the quality of layout predictions one can expect from DocLayNet-trained models by providing a selection of examples without any further post-processing applied. Figure 6 shows selected layout predictions on pages from the test-set of DocLayNet. Results look decent in general across document categories, however one can also observe mistakes such as overlapping clusters of different classes, or entirely missing boxes due to low confidence.</text>
<section_header_level_1><loc_260><loc_55><loc_331><loc_61>6 CONCLUSION</section_header_level_1>
@ -129,7 +130,7 @@
<unordered_list><list_item><loc_262><loc_220><loc_456><loc_234>[1] Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.</list_item>
<list_item><loc_262><loc_235><loc_457><loc_254>[2] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.</list_item>
<list_item><loc_262><loc_256><loc_456><loc_269>[3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.</list_item>
<list_item><loc_262><loc_271><loc_457><loc_290>[4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.</list_item>
<list_item><loc_262><loc_271><loc_457><loc_290>[4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS12824, SpringerVerlag, sep 2021.</list_item>
<list_item><loc_262><loc_291><loc_457><loc_310>[5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.</list_item>
<list_item><loc_262><loc_311><loc_456><loc_325>[6] Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 1015-1022, sep 2019.</list_item>
<list_item><loc_262><loc_326><loc_457><loc_350>[7] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics , COLING, pages 949-960. International Committee on Computational Linguistics, dec 2020.</list_item>

File diff suppressed because one or more lines are too long

View File

@ -1,3 +1,5 @@
2022
## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com
@ -44,7 +46,7 @@ Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staa
Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.
Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:
@ -63,13 +65,13 @@ In Section 5, we will present baseline accuracy numbers for a variety of object
## 2 RELATED WORK
While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].
While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most commonapproach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].
Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.
## 3 THE DOCLAYNET DATASET
DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.
DocLayNet contains 80863 PDF pages. Amongthese, 7059 carry two instances of humanannotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.
In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents
@ -97,21 +99,21 @@ The annotation campaign was carried out in four phases. In phase one, we identif
Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) | triple inter-annotator mAP @ 0.5-0.95 (%) |
|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
| | | % of Total | % of Total | % of Total | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) | triple inter-annotator mAP @0.5-0.95 (%) |
|----------------|---------|--------------|--------------|--------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|
| class label | Count | Train | Test | Val | All | Fin | Man | Sci | Law | Pat | Ten |
| Caption | 22524 | 2.04 | 1.77 | 2.32 | 84-89 | 40-61 | 86-92 | 94-99 | 95-99 | 69-78 | n/a |
| Footnote | 6318 | 0.60 | 0.31 | 0.58 | 83-91 | n/a | 100 | 62-88 | 85-94 | n/a | 82-97 |
| Formula | 25027 | 2.25 | 1.90 | 2.96 | 83-85 | n/a | n/a | 84-87 | 86-96 | n/a | n/a |
| List-item | 185660 | 17.19 | 13.34 | 15.82 | 87-88 | 74-83 | 90-92 | 97-97 | 81-85 | 75-88 | 93-95 |
| Page-footer | 70878 | 6.51 | 5.58 | 6.00 | 93-94 | 88-90 | 95-96 | 100 | 92-97 | 100 | 96-98 |
| Page-header | 58022 | 5.10 | 6.70 | 5.06 | 85-89 | 66-76 | 90-94 | 98-100 | 91-92 | 97-99 | 81-86 |
| Picture | 45976 | 4.21 | 2.78 | 5.31 | 69-71 | 56-59 | 82-86 | 69-82 | 80-95 | 66-71 | 59-76 |
| Section-header | 142884 | 12.60 | 15.77 | 12.85 | 83-84 | 76-81 | 90-92 | 94-95 | 87-94 | 69-73 | 78-86 |
| Table | 34733 | 3.20 | 2.27 | 3.60 | 77-81 | 75-80 | 83-86 | 98-99 | 58-80 | 79-84 | 70-85 |
| Text | 510377 | 45.82 | 49.28 | 45.00 | 84-86 | 81-86 | 88-93 | 89-93 | 87-92 | 71-79 | 87-95 |
| Title | 5071 | 0.47 | 0.30 | 0.50 | 60-72 | 24-63 | 50-63 | 94-100 | 82-96 | 68-79 | 24-56 |
| Total | 1107470 | 941123 | 99816 | 66531 | 82-83 | 71-74 | 79-81 | 89-94 | 86-91 | 71-76 | 68-85 |
Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
@ -119,11 +121,11 @@ Figure 3: Corpus Conversion Service annotation user interface. The PDF page is s
we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.
Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.
Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. Alarge effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv 3 , government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.
Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.
Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This wasachieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula List-item , , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
the textual content of an element, which goes beyond visual layout recognition, in particular outside the Scientific Articles category.
@ -148,9 +150,9 @@ Phase 3: Training. After a first trial with a small group of people, we realised
were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDFtext-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). Wewanted
Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.
Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLOimplementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.
| | human | MRCNN | MRCNN | FRCNN | YOLO |
|----------------|---------|---------|---------|---------|--------|
@ -172,9 +174,9 @@ to avoid this at any cost in order to have clear, unbiased baseline numbers for
## 5 EXPERIMENTS
The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this
The primary goal of DocLayNet is to obtain high-quality MLmodels capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNet dataset with similar data will not yield significantly better predictions.
Figure 5: Prediction performance (mAP@0.5-0.95) of a Mask R-CNNnetworkwithResNet50backbonetrainedonincreasing fractions of the DocLayNet dataset. The learning curve flattens around the 80% mark, indicating that increasing the size of the DocLayNetdataset with similar data will not yield significantly better predictions.
<!-- image -->
@ -233,13 +235,13 @@ lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items),
## Impact of Document Split in Train and Test Set
Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
## Dataset Comparison
Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank &amp; DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank &amp; DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.
| | | Testing on | Testing on | Testing on |
|-----------------|------------|--------------|--------------|--------------|
@ -260,7 +262,7 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .
For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.
For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. Wehad to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.
## Example Predictions
@ -279,7 +281,7 @@ To date, there is still a significant gap between human and ML accuracy on the l
- [1] Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.
- [2] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.
- [3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS 12824, SpringerVerlag, sep 2021.
- [4] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Competition on scientific literature parsing. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 605-617. LNCS12824, SpringerVerlag, sep 2021.
- [5] Logan Markewich, Hao Zhang, Yubin Xing, Navid Lambert-Shirzad, Jiang Zhexin, Roy Lee, Zhi Li, and Seok-Bum Ko. Segmentation for document layout analysis: not dead yet. International Journal on Document Analysis and Recognition (IJDAR) , pages 1-11, 01 2022.
- [6] Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition , ICDAR, pages 1015-1022, sep 2019.
- [7] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics , COLING, pages 949-960. International Committee on Computational Linguistics, dec 2020.

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -1,4 +1,5 @@
<doctag><page_header><loc_15><loc_104><loc_30><loc_350>arXiv:2305.03393v1 [cs.CV] 5 May 2023</page_header>
<doctag><page_header><loc_15><loc_136><loc_30><loc_350>arXiv:2305.03393v1 [cs.CV] 5 May</page_header>
<text><loc_15><loc_104><loc_30><loc_129>2023</text>
<section_header_level_1><loc_110><loc_73><loc_393><loc_92>Optimized Table Tokenization for Table Structure Recognition</section_header_level_1>
<text><loc_114><loc_107><loc_389><loc_126>Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]</text>
<text><loc_188><loc_123><loc_244><loc_129>and Peter Staar</text>
@ -27,7 +28,7 @@
<page_header><loc_110><loc_58><loc_114><loc_65>4</page_header>
<page_header><loc_137><loc_58><loc_189><loc_65>M. Lysak, et al.</page_header>
<text><loc_110><loc_75><loc_393><loc_164>Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.</text>
<text><loc_110><loc_166><loc_393><loc_307>Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.</text>
<text><loc_110><loc_166><loc_393><loc_307>Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCRand uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.</text>
<text><loc_110><loc_309><loc_393><loc_368>Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.</text>
<section_header_level_1><loc_110><loc_382><loc_220><loc_389>3 Problem Statement</section_header_level_1>
<text><loc_110><loc_399><loc_393><loc_420>All known Im2Seq based models for TSR fundamentally work in similar ways. Given an image of a table, the Im2Seq model predicts the structure of the table by generating a sequence of tokens. These tokens originate from a finite vocab-</text>

File diff suppressed because one or more lines are too long

View File

@ -1,3 +1,5 @@
2023
## Optimized Table Tokenization for Table Structure Recognition
Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]
@ -38,7 +40,7 @@ Approaches to formalize the logical structure and layout of tables in electronic
Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( &lt;td&gt; ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( &lt;td&gt; ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCRand uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.

File diff suppressed because one or more lines are too long

View File

@ -1,17 +1,17 @@
<doctag><text><loc_61><loc_30><loc_262><loc_59>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</text>
<text><loc_61><loc_70><loc_262><loc_116>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.</text>
<section_header_level_1><loc_61><loc_127><loc_141><loc_132>Boots Self-Locking Nut</section_header_level_1>
<text><loc_61><loc_136><loc_262><loc_182>nut is of one piece, all-metal The Boots self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</text>
<text><loc_61><loc_193><loc_262><loc_238>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</text>
<text><loc_61><loc_249><loc_262><loc_311>The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</text>
<text><loc_61><loc_322><loc_262><loc_335>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</text>
<picture><loc_59><loc_343><loc_261><loc_449><caption><loc_61><loc_455><loc_155><loc_460>Figure 7-26. Self-locking nuts.</caption></picture>
<text><loc_270><loc_30><loc_472><loc_76>the most common ranges in size for No. 6 up to 1 4 inch, the / Rol-top ranges from 1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</text>
<text><loc_270><loc_78><loc_274><loc_84>.</text>
<section_header_level_1><loc_270><loc_86><loc_380><loc_92>Stainless Steel Self-Locking Nut</section_header_level_1>
<text><loc_270><loc_96><loc_472><loc_230>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</text>
<section_header_level_1><loc_270><loc_241><loc_327><loc_246>Elastic Stop Nut</section_header_level_1>
<text><loc_270><loc_250><loc_470><loc_264>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</text>
<picture><loc_270><loc_272><loc_470><loc_447><caption><loc_270><loc_454><loc_405><loc_459>Figure 7-27. Stainless steel self-locking nut.</caption></picture>
<page_footer><loc_453><loc_472><loc_472><loc_478>7-45</page_footer>
<doctag><text><loc_61><loc_30><loc_260><loc_59>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</text>
<text><loc_61><loc_70><loc_260><loc_116>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; andthe elastic stop nut, representing the fiber insert type.</text>
<section_header_level_1><loc_61><loc_127><loc_139><loc_132>Boots Self-Locking Nut</section_header_level_1>
<text><loc_61><loc_136><loc_260><loc_182>The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</text>
<text><loc_61><loc_193><loc_260><loc_238>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</text>
<text><loc_61><loc_249><loc_260><loc_311>The spring, through the mediumofthe locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</text>
<text><loc_61><loc_322><loc_260><loc_335>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</text>
<picture><loc_59><loc_343><loc_261><loc_449><caption><loc_61><loc_455><loc_153><loc_460>Figure 7-26. Self-locking nuts.</caption></picture>
<text><loc_270><loc_30><loc_470><loc_76>the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</text>
<text><loc_270><loc_78><loc_272><loc_84>.</text>
<section_header_level_1><loc_270><loc_86><loc_378><loc_92>Stainless Steel Self-Locking Nut</section_header_level_1>
<text><loc_270><loc_96><loc_470><loc_230>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</text>
<section_header_level_1><loc_270><loc_241><loc_325><loc_246>Elastic Stop Nut</section_header_level_1>
<text><loc_270><loc_250><loc_470><loc_264>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</text>
<picture><loc_270><loc_272><loc_470><loc_447><caption><loc_270><loc_454><loc_404><loc_459>Figure 7-27. Stainless steel self-locking nut.</caption></picture>
<page_footer><loc_453><loc_472><loc_470><loc_478>7-45</page_footer>
</doctag>

File diff suppressed because one or more lines are too long

Some files were not shown because too many files have changed in this diff Show More