Update test-cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-07-26 03:55:00 +00:00 · 2025-03-25 13:49:08 +01:00 · 2025-03-25 13:49:08 +01:00 · f1f7df49e3
commit f1f7df49e3
parent 75a03c4257 825b226fab
131 changed files with 3536 additions and 1629 deletions
--- a/.actor/.dockerignore
+++ b/.actor/.dockerignore
@ -0,0 +1,11 @@
+**/__pycache__
+**/*.pyc
+**/*.pyo
+**/*.pyd
+.git
+.gitignore
+.env
+.venv
+*.log
+.pytest_cache
+.coverage
--- a/.actor/CHANGELOG.md
+++ b/.actor/CHANGELOG.md
@ -0,0 +1,69 @@
+# Changelog
+
+All notable changes to the Docling Actor will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [1.1.0] - 2025-03-09
+
+### Changed
+
+- Switched from full Docling CLI to docling-serve API
+- Using the official quay.io/ds4sd/docling-serve-cpu Docker image
+- Reduced Docker image size (from ~6GB to ~4GB)
+- Implemented multi-stage Docker build to handle dependencies
+- Improved Docker build process to ensure compatibility with docling-serve-cpu image
+- Added new Python processor script for reliable API communication and content extraction
+- Enhanced response handling with better content extraction logic
+- Fixed ES modules compatibility issue with Apify CLI
+- Added explicit tmpfs volume for temporary files
+- Fixed environment variables format in actor.json
+- Created optimized dependency installation approach
+- Improved API compatibility with docling-serve
+  - Updated endpoint from custom `/convert` to standard `/v1alpha/convert/source`
+  - Revised JSON payload structure to match docling-serve API format
+  - Added proper output field parsing based on format
+- Enhanced startup process with health checks
+- Added configurable API host and port through environment variables
+- Better content type handling for different output formats
+- Updated error handling to align with API responses
+
+### Fixed
+
+- Fixed actor input file conflict in get_actor_input(): now checks for and removes an existing /tmp/actor-input/INPUT directory if found, ensuring valid JSON input parsing.
+
+### Technical Details
+
+- Actor Specification v1
+- Using quay.io/ds4sd/docling-serve-cpu:latest base image
+- Node.js 20.x for Apify CLI
+- Eliminated Python dependencies
+- Simplified Docker build process
+
+## [1.0.0] - 2025-02-07
+
+### Added
+
+- Initial release of Docling Actor
+- Support for multiple document formats (PDF, DOCX, images)
+- OCR capabilities for scanned documents
+- Multiple output formats (md, json, html, text, doctags)
+- Comprehensive error handling and logging
+- Dataset records with processing status
+- Memory monitoring and resource optimization
+- Security features including non-root user execution
+
+### Technical Details
+
+- Actor Specification v1
+- Docling v2.17.0
+- Python 3.11
+- Node.js 20.x
+- Comprehensive error codes:
+  - 10: Invalid input
+  - 11: URL inaccessible
+  - 12: Docling processing failed
+  - 13: Output file missing
+  - 14: Storage operation failed
+  - 15: OCR processing failed
--- a/.actor/Dockerfile
+++ b/.actor/Dockerfile
@ -0,0 +1,87 @@
+# Build stage for installing dependencies
+FROM node:20-slim AS builder
+
+# Install necessary tools and prepare dependencies environment in one layer
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ca-certificates \
+    && rm -rf /var/lib/apt/lists/* \
+    && mkdir -p /build/bin /build/lib/node_modules \
+    && cp /usr/local/bin/node /build/bin/    
+# Set working directory
+WORKDIR /build
+
+# Create package.json and install Apify CLI in one layer
+RUN echo '{"name":"docling-actor-dependencies","version":"1.0.0","description":"Dependencies for Docling Actor","private":true,"type":"module","engines":{"node":">=18"}}' > package.json \
+    && npm install apify-cli@latest \
+    && cp -r node_modules/* lib/node_modules/ \
+    && echo '#!/bin/sh\n/tmp/docling-tools/bin/node /tmp/docling-tools/lib/node_modules/apify-cli/bin/run "$@"' > bin/actor \
+    && chmod +x bin/actor \
+    # Clean up npm cache to reduce image size
+    && npm cache clean --force
+
+# Final stage with docling-serve-cpu
+FROM quay.io/ds4sd/docling-serve-cpu:latest
+
+LABEL maintainer="Vaclav Vancura <@vancura>" \
+      description="Apify Actor for document processing using Docling" \
+      version="1.1.0"
+
+# Set only essential environment variables
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    DOCLING_SERVE_HOST=0.0.0.0 \
+    DOCLING_SERVE_PORT=5001
+
+# Switch to root temporarily to set up directories and permissions
+USER root
+WORKDIR /app
+
+# Install required tools and create directories in a single layer
+RUN dnf install -y \
+    jq \
+    && dnf clean all \
+    && mkdir -p /build-files \
+             /tmp \
+             /tmp/actor-input \
+             /tmp/actor-output \
+             /tmp/actor-storage \
+             /tmp/apify_input \
+             /apify_input \
+             /opt/app-root/src/.EasyOCR/user_network \
+             /tmp/easyocr-models \
+    && chown 1000:1000 /build-files \
+    && chown -R 1000:1000 /opt/app-root/src/.EasyOCR \
+    && chmod 1777 /tmp \
+    && chmod 1777 /tmp/easyocr-models \
+    && chmod 777 /tmp/actor-input /tmp/actor-output /tmp/actor-storage /tmp/apify_input /apify_input \
+    # Fix for uv_os_get_passwd error in Node.js
+    && echo "docling:x:1000:1000:Docling User:/app:/bin/sh" >> /etc/passwd
+
+# Set environment variable to tell EasyOCR to use a writable location for models
+ENV EASYOCR_MODULE_PATH=/tmp/easyocr-models
+
+# Copy only required files
+COPY --chown=1000:1000 .actor/actor.sh .actor/actor.sh
+COPY --chown=1000:1000 .actor/actor.json .actor/actor.json
+COPY --chown=1000:1000 .actor/input_schema.json .actor/input_schema.json
+COPY --chown=1000:1000 .actor/docling_processor.py .actor/docling_processor.py
+RUN chmod +x .actor/actor.sh
+
+# Copy the build files from builder
+COPY --from=builder --chown=1000:1000 /build /build-files
+
+
+# Switch to non-root user
+USER 1000
+
+# Set up TMPFS for temporary files
+VOLUME ["/tmp"]
+
+# Create additional volumes for OCR models persistence
+VOLUME ["/tmp/easyocr-models"]
+
+# Expose the docling-serve API port
+EXPOSE 5001
+
+# Run the actor script
+ENTRYPOINT [".actor/actor.sh"]
--- a/.actor/README.md
+++ b/.actor/README.md
@ -0,0 +1,314 @@
+# Docling Actor on Apify
+
+[![Docling Actor](https://apify.com/actor-badge?actor=vancura/docling?fpr=docling)](https://apify.com/vancura/docling)
+
+This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.io/docling/) to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.
+
+## What are Actors?
+
+[Actors](https://docs.apify.com/platform/actors?fpr=docling) are serverless microservices running on the [Apify Platform](https://apify.com/?fpr=docling). They are based on the [Actor SDK](https://docs.apify.com/sdk/js?fpr=docling) and can be found in the [Apify Store](https://apify.com/store?fpr=docling). Learn more about Actors in the [Apify Whitepaper](https://whitepaper.actor?fpr=docling).
+
+## Table of Contents
+
+1. [Features](#features)
+2. [Usage](#usage)
+3. [Input Parameters](#input-parameters)
+4. [Output](#output)
+5. [Performance & Resources](#performance--resources)
+6. [Troubleshooting](#troubleshooting)
+7. [Local Development](#local-development)
+8. [Architecture](#architecture)
+9. [License](#license)
+10. [Acknowledgments](#acknowledgments)
+11. [Security Considerations](#security-considerations)
+
+## Features
+
+- Leverages the official docling-serve-cpu Docker image for efficient document processing
+- Processes multiple document formats:
+  - PDF documents (scanned or digital)
+  - Microsoft Office files (DOCX, XLSX, PPTX)
+  - Images (PNG, JPG, TIFF)
+  - Other text-based formats
+- Provides OCR capabilities for scanned documents
+- Exports to multiple formats:
+  - Markdown
+  - JSON
+  - HTML
+  - Plain Text
+  - DocTags (structured format)
+- No local setup needed—just provide input via a simple JSON config
+
+## Usage
+
+### Using Apify Console
+
+1. Go to the Apify Actor page.
+2. Click "Run".
+3. In the input form, fill in:
+   - The URL of the document.
+   - Output format (`md`, `json`, `html`, `text`, or `doctags`).
+   - OCR boolean toggle.
+4. The Actor will run and produce its outputs in the default key-value store under the key `OUTPUT`.
+
+### Using Apify API
+
+```bash
+curl --request POST \
+  --url "https://api.apify.com/v2/acts/vancura~docling/run" \
+  --header 'Content-Type: application/json' \
+  --header 'Authorization: Bearer YOUR_API_TOKEN' \
+  --data '{
+  "options": {
+    "to_formats": ["md", "json", "html", "text", "doctags"]
+  },
+  "http_sources": [
+    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
+    {"url": "https://arxiv.org/pdf/2408.09869"}
+  ]
+}'
+```
+
+### Using Apify CLI
+
+```bash
+apify call vancura/docling --input='{
+  "options": {
+    "to_formats": ["md", "json", "html", "text", "doctags"]
+  },
+  "http_sources": [
+    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
+    {"url": "https://arxiv.org/pdf/2408.09869"}
+  ]
+}'
+```
+
+## Input Parameters
+
+The Actor accepts a JSON schema matching the file `.actor/input_schema.json`. Below is a summary of the fields:
+
+| Field          | Type    | Required | Default  | Description                                                                   |
+|----------------|---------|----------|----------|-------------------------------------------------------------------------------|
+| `http_sources` | object  | Yes      | None     | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#url-endpoint        |
+| `options`      | object  | No       | None     | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#common-parameters   |
+
+### Example Input
+
+```json
+{
+  "options": {
+    "to_formats": ["md", "json", "html", "text", "doctags"]
+  },
+  "http_sources": [
+    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
+    {"url": "https://arxiv.org/pdf/2408.09869"}
+  ]
+}
+```
+
+## Output
+
+The Actor provides three types of outputs:
+
+1. **Processed Documents in a ZIP** - The Actor will provide the direct URL to your result in the run log, looking like:
+
+   ```text
+   You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'
+   ```
+
+2. **Processing Log** - Available in the key-value store as `DOCLING_LOG`
+
+3. **Dataset Record** - Contains processing metadata with:
+   - Direct link to the processed output zip file
+   - Processing status
+
+You can access the results in several ways:
+
+1. **Direct URL** (shown in Actor run logs):
+
+```text
+https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT
+```
+
+2. **Programmatically** via Apify CLI:
+
+```bash
+apify key-value-stores get-value OUTPUT
+```
+
+3. **Dataset** - Check the "Dataset" tab in the Actor run details to see processing metadata
+
+### Example Outputs
+
+#### Markdown (md)
+
+```markdown
+# Document Title
+
+## Section 1
+Content of section 1...
+
+## Section 2
+Content of section 2...
+```
+
+#### JSON
+
+```json
+{
+    "title": "Document Title",
+    "sections": [
+        {
+            "level": 1,
+            "title": "Section 1",
+            "content": "Content of section 1..."
+        }
+    ]
+}
+```
+
+#### HTML
+
+```html
+<h1>Document Title</h1>
+<h2>Section 1</h2>
+<p>Content of section 1...</p>
+```
+
+### Processing Logs (`DOCLING_LOG`)
+
+The Actor maintains detailed processing logs including:
+
+- API request and response details
+- Processing steps and timing
+- Error messages and stack traces
+- Input validation results
+
+Access logs via:
+
+```bash
+apify key-value-stores get-record DOCLING_LOG
+```
+
+## Performance & Resources
+
+- **Docker Image Size**: ~4GB
+- **Memory Requirements**:
+  - Minimum: 2 GB RAM
+  - Recommended: 4 GB RAM for large or complex documents
+- **Processing Time**:
+  - Simple documents: 15-30 seconds
+  - Complex PDFs with OCR: 1-3 minutes
+  - Large documents (100+ pages): 3-10 minutes
+
+## Troubleshooting
+
+Common issues and solutions:
+
+1. **Document URL Not Accessible**
+   - Ensure the URL is publicly accessible
+   - Check if the document requires authentication
+   - Verify the URL leads directly to the document
+
+2. **OCR Processing Fails**
+   - Verify the document is not password-protected
+   - Check if the image quality is sufficient
+   - Try processing with OCR disabled
+
+3. **API Response Issues**
+   - Check the logs for detailed error messages
+   - Ensure the document format is supported
+   - Verify the URL is correctly formatted
+
+4. **Output Format Issues**
+   - Verify the output format is supported
+   - Check if the document structure is compatible
+   - Review the `DOCLING_LOG` for specific errors
+
+### Error Handling
+
+The Actor implements comprehensive error handling:
+
+- Detailed error messages in `DOCLING_LOG`
+- Proper exit codes for different failure scenarios
+- Automatic cleanup on failure
+- Dataset records with processing status
+
+## Local Development
+
+If you wish to develop or modify this Actor locally:
+
+1. Clone the repository.
+2. Ensure Docker is installed.
+3. The Actor files are located in the `.actor` directory:
+   - `Dockerfile` - Defines the container environment
+   - `actor.json` - Actor configuration and metadata
+   - `actor.sh` - Main execution script that starts the docling-serve API and orchestrates document processing
+   - `input_schema.json` - Input parameter definitions
+   - `dataset_schema.json` - Dataset output format definition
+   - `CHANGELOG.md` - Change log documenting all notable changes
+   - `README.md` - This documentation
+4. Run the Actor locally using:
+
+   ```bash
+   apify run
+   ```
+
+### Actor Structure
+
+```text
+.actor/
+├── Dockerfile           # Container definition
+├── actor.json           # Actor metadata
+├── actor.sh             # Execution script (also starts docling-serve API)
+├── input_schema.json    # Input parameters
+├── dataset_schema.json  # Dataset output format definition
+├── docling_processor.py # Python script for API communication
+├── CHANGELOG.md         # Version history and changes
+└── README.md            # This documentation
+```
+
+## Architecture
+
+This Actor uses a lightweight architecture based on the official `quay.io/ds4sd/docling-serve-cpu` Docker image:
+
+- **Base Image**: `quay.io/ds4sd/docling-serve-cpu:latest` (~4GB)
+- **Multi-Stage Build**: Uses a multi-stage Docker build to include only necessary tools
+- **API Communication**: Uses the RESTful API provided by docling-serve
+- **Request Flow**:
+  1. The actor script starts the docling-serve API on port 5001
+  2. Performs health checks to ensure the API is running
+  3. Processes the input parameters
+  4. Creates a JSON payload for the docling-serve API with proper format:
+     ```json
+     {
+       "options": {
+         "to_formats": ["md"],
+         "do_ocr": true
+       },
+       "http_sources": [{"url": "https://example.com/document.pdf"}]
+     }
+     ```
+  5. Makes a POST request to the `/v1alpha/convert/source` endpoint
+  6. Processes the response and stores it in the key-value store
+- **Dependencies**:
+  - Node.js for Apify CLI
+  - Essential tools (curl, jq, etc.) copied from build stage
+- **Security**: Runs as a non-root user for enhanced security
+
+## License
+
+This wrapper project is under the MIT License, matching the original Docling license. See [LICENSE](../LICENSE) for details.
+
+## Acknowledgments
+
+- [Docling](https://ds4sd.github.io/docling/) and [docling-serve-cpu](https://quay.io/repository/ds4sd/docling-serve-cpu) by IBM
+- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
+
+## Security Considerations
+
+- Actor runs under a non-root user for enhanced security
+- Input URLs are validated before processing
+- Temporary files are securely managed and cleaned up
+- Process isolation through Docker containerization
+- Secure handling of processing artifacts
--- a/.actor/actor.json
+++ b/.actor/actor.json
@ -0,0 +1,11 @@
+{
+  "actorSpecification": 1,
+  "name": "docling",
+  "version": "0.0",
+  "environmentVariables": {},
+  "dockerFile": "./Dockerfile",
+  "input": "./input_schema.json",
+  "scripts": {
+    "run": "./actor.sh"
+  }
+}
--- a/.actor/actor.sh
+++ b/.actor/actor.sh
@ -0,0 +1,419 @@
+#!/bin/bash
+
+export PATH=$PATH:/build-files/node_modules/.bin
+
+# Function to upload content to the key-value store
+upload_to_kvs() {
+    local content_file="$1"
+    local key_name="$2"
+    local content_type="$3"
+    local description="$4"
+
+    # Find the Apify CLI command
+    find_apify_cmd
+    local apify_cmd="$FOUND_APIFY_CMD"
+
+    if [ -n "$apify_cmd" ]; then
+        echo "Uploading $description to key-value store (key: $key_name)..."
+
+        # Create a temporary home directory with write permissions
+        setup_temp_environment
+
+        # Use the --no-update-notifier flag if available
+        if $apify_cmd --help | grep -q "\--no-update-notifier"; then
+            if $apify_cmd --no-update-notifier actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
+                echo "Successfully uploaded $description to key-value store"
+                local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
+                echo "$description available at: $url"
+                cleanup_temp_environment
+                return 0
+            fi
+        else
+            # Fall back to regular command if flag isn't available
+            if $apify_cmd actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
+                echo "Successfully uploaded $description to key-value store"
+                local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
+                echo "$description available at: $url"
+                cleanup_temp_environment
+                return 0
+            fi
+        fi
+
+        echo "ERROR: Failed to upload $description to key-value store"
+        cleanup_temp_environment
+        return 1
+    else
+        echo "ERROR: Apify CLI not found for $description upload"
+        return 1
+    fi
+}
+
+# Function to find Apify CLI command
+find_apify_cmd() {
+    FOUND_APIFY_CMD=""
+    for cmd in "apify" "actor" "/usr/local/bin/apify" "/usr/bin/apify" "/opt/apify/cli/bin/apify"; do
+        if command -v "$cmd" &> /dev/null; then
+            FOUND_APIFY_CMD="$cmd"
+            break
+        fi
+    done
+}
+
+# Function to set up temporary environment for Apify CLI
+setup_temp_environment() {
+    export TMPDIR="/tmp/apify-home-${RANDOM}"
+    mkdir -p "$TMPDIR"
+    export APIFY_DISABLE_VERSION_CHECK=1
+    export NODE_OPTIONS="--no-warnings"
+    export HOME="$TMPDIR"  # Override home directory to writable location
+}
+
+# Function to clean up temporary environment
+cleanup_temp_environment() {
+    rm -rf "$TMPDIR" 2>/dev/null || true
+}
+
+# Function to push data to Apify dataset
+push_to_dataset() {
+    # Example usage: push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
+
+    local result_url="$1"
+    local size="$2"
+    local format="$3"
+
+    # Find Apify CLI command
+    find_apify_cmd
+    local apify_cmd="$FOUND_APIFY_CMD"
+
+    if [ -n "$apify_cmd" ]; then
+        echo "Adding record to dataset..."
+        setup_temp_environment
+
+        # Use the --no-update-notifier flag if available
+        if $apify_cmd --help | grep -q "\--no-update-notifier"; then
+            if $apify_cmd --no-update-notifier actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
+                echo "Successfully added record to dataset"
+            else
+                echo "Warning: Failed to add record to dataset"
+            fi
+        else
+            # Fall back to regular command
+            if $apify_cmd actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
+                echo "Successfully added record to dataset"
+            else
+                echo "Warning: Failed to add record to dataset"
+            fi
+        fi
+
+        cleanup_temp_environment
+    fi
+}
+
+
+# --- Setup logging and error handling ---
+
+LOG_FILE="/tmp/docling.log"
+touch "$LOG_FILE" || {
+    echo "Fatal: Cannot create log file at $LOG_FILE"
+    exit 1
+}
+
+# Log to both console and file
+exec 1> >(tee -a "$LOG_FILE")
+exec 2> >(tee -a "$LOG_FILE" >&2)
+
+# Exit codes
+readonly ERR_API_UNAVAILABLE=15
+readonly ERR_INVALID_INPUT=16
+
+
+# --- Debug environment ---
+
+echo "Date: $(date)"
+echo "Python version: $(python --version 2>&1)"
+echo "Docling-serve path: $(which docling-serve 2>/dev/null || echo 'Not found')"
+echo "Working directory: $(pwd)"
+
+# --- Get input ---
+
+echo "Getting Apify Actor Input"
+INPUT=$(apify actor get-input 2>/dev/null)
+
+# --- Setup tools ---
+
+echo "Setting up tools..."
+TOOLS_DIR="/tmp/docling-tools"
+mkdir -p "$TOOLS_DIR"
+
+# Copy tools if available
+if [ -d "/build-files" ]; then
+    echo "Copying tools from /build-files..."
+    cp -r /build-files/* "$TOOLS_DIR/"
+    export PATH="$TOOLS_DIR/bin:$PATH"
+else
+    echo "Warning: No build files directory found. Some tools may be unavailable."
+fi
+
+# Copy Python processor script to tools directory
+PYTHON_SCRIPT_PATH="$(dirname "$0")/docling_processor.py"
+if [ -f "$PYTHON_SCRIPT_PATH" ]; then
+    echo "Copying Python processor script to tools directory..."
+    cp "$PYTHON_SCRIPT_PATH" "$TOOLS_DIR/"
+    chmod +x "$TOOLS_DIR/docling_processor.py"
+else
+    echo "ERROR: Python processor script not found at $PYTHON_SCRIPT_PATH"
+    exit 1
+fi
+
+# Check OCR directories and ensure they're writable
+echo "Checking OCR directory permissions..."
+OCR_DIR="/opt/app-root/src/.EasyOCR"
+if [ -d "$OCR_DIR" ]; then
+    # Test if we can write to the directory
+    if touch "$OCR_DIR/test_write" 2>/dev/null; then
+        echo "[✓] OCR directory is writable"
+        rm "$OCR_DIR/test_write"
+    else
+        echo "[✗] OCR directory is not writable, setting up alternative in /tmp"
+
+        # Create alternative in /tmp (which is writable)
+        mkdir -p "/tmp/.EasyOCR/user_network"
+        export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
+    fi
+else
+    echo "OCR directory not found, creating in /tmp"
+    mkdir -p "/tmp/.EasyOCR/user_network"
+    export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
+fi
+
+
+# --- Starting the API ---
+
+echo "Starting docling-serve API..."
+
+# Create a dedicated working directory in /tmp (writable)
+API_DIR="/tmp/docling-api"
+mkdir -p "$API_DIR"
+cd "$API_DIR"
+echo "API working directory: $(pwd)"
+
+# Find docling-serve executable
+DOCLING_SERVE_PATH=$(which docling-serve)
+echo "Docling-serve executable: $DOCLING_SERVE_PATH"
+
+# Start the API with minimal parameters to avoid any issues
+echo "Starting docling-serve API..."
+"$DOCLING_SERVE_PATH" run --host 0.0.0.0 --port 5001 > "$API_DIR/docling-serve.log" 2>&1 &
+API_PID=$!
+echo "Started docling-serve API with PID: $API_PID"
+
+# A more reliable wait for API startup
+echo "Waiting for API to initialize..."
+MAX_TRIES=30
+tries=0
+started=false
+
+while [ $tries -lt $MAX_TRIES ]; do
+    tries=$((tries + 1))
+
+    # Check if process is still running
+    if ! ps -p $API_PID > /dev/null; then
+        echo "ERROR: docling-serve API process terminated unexpectedly after $tries seconds"
+        break
+    fi
+
+    # Check log for startup completion or errors
+    if grep -q "Application startup complete" "$API_DIR/docling-serve.log" 2>/dev/null; then
+        echo "[✓] API startup completed successfully after $tries seconds"
+        started=true
+        break
+    fi
+
+    if grep -q "Permission denied\|PermissionError" "$API_DIR/docling-serve.log" 2>/dev/null; then
+        echo "ERROR: Permission errors detected in API startup"
+        break
+    fi
+
+    # Sleep and check again
+    sleep 1
+
+    # Output a progress indicator every 5 seconds
+    if [ $((tries % 5)) -eq 0 ]; then
+        echo "Still waiting for API startup... ($tries/$MAX_TRIES seconds)"
+    fi
+done
+
+# Show log content regardless of outcome
+echo "docling-serve log output so far:"
+tail -n 20 "$API_DIR/docling-serve.log"
+
+# Verify the API is running
+if ! ps -p $API_PID > /dev/null; then
+    echo "ERROR: docling-serve API failed to start"
+    if [ -f "$API_DIR/docling-serve.log" ]; then
+        echo "Full log output:"
+        cat "$API_DIR/docling-serve.log"
+    fi
+    exit $ERR_API_UNAVAILABLE
+fi
+
+if [ "$started" != "true" ]; then
+    echo "WARNING: API process is running but startup completion was not detected"
+    echo "Will attempt to continue anyway..."
+fi
+
+# Try to verify API is responding at this point
+echo "Verifying API responsiveness..."
+(python -c "
+import sys, time, socket
+for i in range(5):
+    try:
+        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+        s.settimeout(1)
+        result = s.connect_ex(('localhost', 5001))
+        if result == 0:
+            s.close()
+            print('Port 5001 is open and accepting connections')
+            sys.exit(0)
+        s.close()
+    except Exception as e:
+        pass
+    time.sleep(1)
+print('Could not connect to API port after 5 attempts')
+sys.exit(1)
+" && echo "API verification succeeded") || echo "API verification failed, but continuing anyway"
+
+# Define API endpoint
+DOCLING_API_ENDPOINT="http://localhost:5001/v1alpha/convert/source"
+
+
+# --- Processing document ---
+
+echo "Starting document processing..."
+echo "Reading input from Apify..."
+
+echo "Input content:" >&2
+echo "$INPUT" >&2  # Send the raw input to stderr for debugging
+echo "$INPUT"      # Send the clean JSON to stdout for processing
+
+# Create the request JSON
+
+REQUEST_JSON=$(echo $INPUT | jq '.options += {"return_as_file": true}')
+
+echo "Creating request JSON:" >&2
+echo "$REQUEST_JSON" >&2
+echo "$REQUEST_JSON" > "$API_DIR/request.json"
+
+
+# Send the conversion request using our Python script
+#echo "Sending conversion request to docling-serve API..."
+#python "$TOOLS_DIR/docling_processor.py" \
+#    --api-endpoint "$DOCLING_API_ENDPOINT" \
+#    --request-json "$API_DIR/request.json" \
+#    --output-dir "$API_DIR" \
+#    --output-format "$OUTPUT_FORMAT"
+
+echo "Curl the Docling API"
+curl -s -H "content-type: application/json" -X POST --data-binary @$API_DIR/request.json -o $API_DIR/output.zip $DOCLING_API_ENDPOINT
+
+CURL_EXIT_CODE=$?
+
+# --- Check for various potential output files ---
+
+echo "Checking for output files..."
+if [ -f "$API_DIR/output.zip" ]; then
+    echo "Conversion completed successfully! Output file found."
+
+    # Get content from the converted file
+    OUTPUT_SIZE=$(wc -c < "$API_DIR/output.zip")
+    echo "Output file found with size: $OUTPUT_SIZE bytes"
+
+    # Calculate the access URL for result display
+    RESULT_URL="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/OUTPUT"
+
+    echo "=============================="
+    echo "PROCESSING COMPLETE!"
+    echo "Output size: ${OUTPUT_SIZE} bytes"
+    echo "=============================="
+
+    # Set the output content type based on format
+    CONTENT_TYPE="application/zip"
+
+    # Upload the document content using our function
+    upload_to_kvs "$API_DIR/output.zip" "OUTPUT" "$CONTENT_TYPE" "Document content"
+
+    # Only proceed with dataset record if document upload succeeded
+    if [ $? -eq 0 ]; then
+        echo "Your document is available at: ${RESULT_URL}"
+        echo "=============================="
+
+        # Push data to dataset
+        push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
+    fi
+else
+    echo "ERROR: No converted output file found at $API_DIR/output.zip"
+
+    # Create error metadata
+    ERROR_METADATA="{\"status\":\"error\",\"error\":\"No converted output file found\",\"documentUrl\":\"$DOCUMENT_URL\"}"
+    echo "$ERROR_METADATA" > "/tmp/actor-output/OUTPUT"
+    chmod 644 "/tmp/actor-output/OUTPUT"
+
+    echo "Error information has been saved to /tmp/actor-output/OUTPUT"
+fi
+
+
+# --- Verify output files for debugging ---
+
+echo "=== Final Output Verification ==="
+echo "Files in /tmp/actor-output:"
+ls -la /tmp/actor-output/ 2>/dev/null || echo "Cannot list /tmp/actor-output/"
+
+echo "All operations completed. The output should be available in the default key-value store."
+echo "Content URL: ${RESULT_URL:-No URL available}"
+
+
+# --- Cleanup function ---
+
+cleanup() {
+    echo "Running cleanup..."
+
+    # Stop the API process
+    if [ -n "$API_PID" ]; then
+        echo "Stopping docling-serve API (PID: $API_PID)..."
+        kill $API_PID 2>/dev/null || true
+    fi
+
+    # Export log file to KVS if it exists
+    # DO THIS BEFORE REMOVING TOOLS DIRECTORY
+    if [ -f "$LOG_FILE" ]; then
+        if [ -s "$LOG_FILE" ]; then
+            echo "Log file is not empty, pushing to key-value store (key: LOG)..."
+
+            # Upload log using our function
+            upload_to_kvs "$LOG_FILE" "LOG" "text/plain" "Log file"
+        else
+            echo "Warning: log file exists but is empty"
+        fi
+    else
+        echo "Warning: No log file found"
+    fi
+
+    # Clean up temporary files AFTER log is uploaded
+    echo "Cleaning up temporary files..."
+    if [ -d "$API_DIR" ]; then
+        echo "Removing API working directory: $API_DIR"
+        rm -rf "$API_DIR" 2>/dev/null || echo "Warning: Failed to remove $API_DIR"
+    fi
+
+    if [ -d "$TOOLS_DIR" ]; then
+        echo "Removing tools directory: $TOOLS_DIR"
+        rm -rf "$TOOLS_DIR" 2>/dev/null || echo "Warning: Failed to remove $TOOLS_DIR"
+    fi
+
+    # Keep log file until the very end
+    echo "Script execution completed at $(date)"
+    echo "Actor execution completed"
+}
+
+# Register cleanup
+trap cleanup EXIT
--- a/.actor/dataset_schema.json
+++ b/.actor/dataset_schema.json
@ -0,0 +1,31 @@
+{
+    "title": "Docling Actor Dataset",
+    "description": "Records of document processing results from the Docling Actor",
+    "type": "object",
+    "schemaVersion": 1,
+    "properties": {
+        "url": {
+            "title": "Document URL",
+            "type": "string",
+            "description": "URL of the processed document"
+        },
+        "output_file": {
+            "title": "Result URL",
+            "type": "string",
+            "description": "Direct URL to the processed result in key-value store"
+        },
+        "status": {
+            "title": "Processing Status",
+            "type": "string",
+            "description": "Status of the document processing",
+            "enum": ["success", "error"]
+        },
+        "error": {
+            "title": "Error Details",
+            "type": "string",
+            "description": "Error message if processing failed",
+            "optional": true
+        }
+    },
+    "required": ["url", "output_file", "status"]
+} 
--- a/.actor/input_schema.json
+++ b/.actor/input_schema.json
@ -0,0 +1,27 @@
+{
+  "title": "Docling Actor Input",
+  "description": "Options for processing documents with Docling via the docling-serve API.",
+  "type": "object",
+  "schemaVersion": 1,
+  "properties": {
+    "http_sources": {
+      "title": "Document URLs",
+      "type": "array",
+      "description": "URLs of documents to process. Supported formats: PDF, DOCX, PPTX, XLSX, HTML, MD, XML, images, and more.",
+      "editor": "json",
+      "prefill": [
+        { "url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf" }
+      ]
+    },    
+    "options": {
+      "title": "Processing Options",
+      "type": "object",
+      "description": "Document processing configuration options",
+      "editor": "json",
+      "prefill": {
+        "to_formats": ["md"]
+      }
+    }
+  },
+  "required": ["options", "http_sources"]
+}
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,40 @@
+## [v2.28.0](https://github.com/docling-project/docling/releases/tag/v2.28.0) - 2025-03-19
+
+### Feature
+
+* **SmolDocling:** Support MLX acceleration in VLM pipeline ([#1199](https://github.com/docling-project/docling/issues/1199)) ([`1c26769`](https://github.com/docling-project/docling/commit/1c26769785bcd17c0b8b621c5182ad81134d3915))
+* Add PPTX notes slides ([#474](https://github.com/docling-project/docling/issues/474)) ([`b454aa1`](https://github.com/docling-project/docling/commit/b454aa1551b891644ce4028ed2d7ec8f82c167ab))
+* Updated vlm pipeline (with latest changes from docling-core) ([#1158](https://github.com/docling-project/docling/issues/1158)) ([`2f72167`](https://github.com/docling-project/docling/commit/2f72167ff6421424dea4d93018b0d43af16ec153))
+
+### Fix
+
+* Determine correct page size in DoclingParseV4Backend ([#1196](https://github.com/docling-project/docling/issues/1196)) ([`f5adfb9`](https://github.com/docling-project/docling/commit/f5adfb9724aae1207f23e21d74033f331e6e1ffb))
+* **msword:** Fixing function return in equations handling ([#1194](https://github.com/docling-project/docling/issues/1194)) ([`0b707d0`](https://github.com/docling-project/docling/commit/0b707d0882f5be42505871799387d0b1882bffbf))
+
+### Documentation
+
+* Linux Foundation AI & Data ([#1183](https://github.com/docling-project/docling/issues/1183)) ([`1d680b0`](https://github.com/docling-project/docling/commit/1d680b0a321d95fc6bd65b7bb4d5e15005a0250a))
+* Move apify to docs ([#1182](https://github.com/docling-project/docling/issues/1182)) ([`54a78c3`](https://github.com/docling-project/docling/commit/54a78c307de833b93f9b84cf1f8ed6dace8573cb))
+
+## [v2.27.0](https://github.com/docling-project/docling/releases/tag/v2.27.0) - 2025-03-18
+
+### Feature
+
+* Add factory for ocr engines via plugins ([#1010](https://github.com/docling-project/docling/issues/1010)) ([`6eaae3c`](https://github.com/docling-project/docling/commit/6eaae3cba034599020dc06ebdad3bc3ff0b5a8eb))
+* Add DoclingParseV4 backend, using high-level docling-parse API ([#905](https://github.com/docling-project/docling/issues/905)) ([`3960b19`](https://github.com/docling-project/docling/commit/3960b199d63d0e9d660aeb0cbced02b38bb0b593))
+* **actor:** Docling Actor on Apify infrastructure ([#875](https://github.com/docling-project/docling/issues/875)) ([`772487f`](https://github.com/docling-project/docling/commit/772487f9c91ad2ee53c591c314c72443f9cbfd23))
+* Equations to latex in MSWord backend (with inline groups) ([#1114](https://github.com/docling-project/docling/issues/1114)) ([`6eb718f`](https://github.com/docling-project/docling/commit/6eb718f8493038d1b4b6ae836df5a24aa13cd17e))
+
+### Fix
+
+* **html:** Handle nested empty lists ([#1154](https://github.com/docling-project/docling/issues/1154)) ([`f94da44`](https://github.com/docling-project/docling/commit/f94da44ec5c7a8c92b9dd60e4df5dc945ed6d1ea))
+* Use first table row as col headers ([#1156](https://github.com/docling-project/docling/issues/1156)) ([`0945973`](https://github.com/docling-project/docling/commit/0945973b79d67b74281aba5102ee985ac1de74ea))
+* Pass tests, update docling-core to 2.22.0 ([#1150](https://github.com/docling-project/docling/issues/1150)) ([`aa92a57`](https://github.com/docling-project/docling/commit/aa92a57fa9e7228e894efb9050a0cdb9f287ebfd))
+
+### Documentation
+
+* Fix spelling of picture in usage ([#1165](https://github.com/docling-project/docling/issues/1165)) ([`7e01798`](https://github.com/docling-project/docling/commit/7e01798417c424c05685e0ff5f6f89f70dc3bfcd))
+
 ## [v2.26.0](https://github.com/docling-project/docling/releases/tag/v2.26.0) - 2025-03-11

 ### Feature
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@ -1,129 +1,3 @@
 # Contributor Covenant Code of Conduct

-## Our Pledge
-
-We as members, contributors, and leaders pledge to make participation in our
-community a harassment-free experience for everyone, regardless of age, body
-size, visible or invisible disability, ethnicity, sex characteristics, gender
-identity and expression, level of experience, education, socio-economic status,
-nationality, personal appearance, race, religion, or sexual identity
-and orientation.
-
-We pledge to act and interact in ways that contribute to an open, welcoming,
-diverse, inclusive, and healthy community.
-
-## Our Standards
-
-Examples of behavior that contributes to a positive environment for our
-community include:
-
-* Demonstrating empathy and kindness toward other people
-* Being respectful of differing opinions, viewpoints, and experiences
-* Giving and gracefully accepting constructive feedback
-* Accepting responsibility and apologizing to those affected by our mistakes,
-  and learning from the experience
-* Focusing on what is best not just for us as individuals, but for the
-  overall community
-
-Examples of unacceptable behavior include:
-
-* The use of sexualized language or imagery, and sexual attention or
-  advances of any kind
-* Trolling, insulting or derogatory comments, and personal or political attacks
-* Public or private harassment
-* Publishing others' private information, such as a physical or email
-  address, without their explicit permission
-* Other conduct which could reasonably be considered inappropriate in a
-  professional setting
-
-## Enforcement Responsibilities
-
-Community leaders are responsible for clarifying and enforcing our standards of
-acceptable behavior and will take appropriate and fair corrective action in
-response to any behavior that they deem inappropriate, threatening, offensive,
-or harmful.
-
-Community leaders have the right and responsibility to remove, edit, or reject
-comments, commits, code, wiki edits, issues, and other contributions that are
-not aligned to this Code of Conduct, and will communicate reasons for moderation
-decisions when appropriate.
-
-## Scope
-
-This Code of Conduct applies within all community spaces, and also applies when
-an individual is officially representing the community in public spaces.
-Examples of representing our community include using an official e-mail address,
-posting via an official social media account, or acting as an appointed
-representative at an online or offline event.
-
-## Enforcement
-
-Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported to the community leaders responsible for enforcement using
-[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
-
-All complaints will be reviewed and investigated promptly and fairly.
-
-All community leaders are obligated to respect the privacy and security of the
-reporter of any incident.
-
-## Enforcement Guidelines
-
-Community leaders will follow these Community Impact Guidelines in determining
-the consequences for any action they deem in violation of this Code of Conduct:
-
-### 1. Correction
-
-**Community Impact**: Use of inappropriate language or other behavior deemed
-unprofessional or unwelcome in the community.
-
-**Consequence**: A private, written warning from community leaders, providing
-clarity around the nature of the violation and an explanation of why the
-behavior was inappropriate. A public apology may be requested.
-
-### 2. Warning
-
-**Community Impact**: A violation through a single incident or series
-of actions.
-
-**Consequence**: A warning with consequences for continued behavior. No
-interaction with the people involved, including unsolicited interaction with
-those enforcing the Code of Conduct, for a specified period of time. This
-includes avoiding interactions in community spaces as well as external channels
-like social media. Violating these terms may lead to a temporary or
-permanent ban.
-
-### 3. Temporary Ban
-
-**Community Impact**: A serious violation of community standards, including
-sustained inappropriate behavior.
-
-**Consequence**: A temporary ban from any sort of interaction or public
-communication with the community for a specified period of time. No public or
-private interaction with the people involved, including unsolicited interaction
-with those enforcing the Code of Conduct, is allowed during this period.
-Violating these terms may lead to a permanent ban.
-
-### 4. Permanent Ban
-
-**Community Impact**: Demonstrating a pattern of violation of community
-standards, including sustained inappropriate behavior,  harassment of an
-individual, or aggression toward or disparagement of classes of individuals.
-
-**Consequence**: A permanent ban from any sort of public interaction within
-the community.
-
-## Attribution
-
-This Code of Conduct is adapted from the [Contributor Covenant][homepage],
-version 2.0, available at
-[https://www.contributor-covenant.org/version/2/0/code_of_conduct.html](https://www.contributor-covenant.org/version/2/0/code_of_conduct.html).
-
-Community Impact Guidelines were inspired by [Mozilla's code of conduct
-enforcement ladder](https://github.com/mozilla/diversity).
-
-Homepage: [https://www.contributor-covenant.org](https://www.contributor-covenant.org)
-
-For answers to common questions about this code of conduct, see the FAQ at
-[https://www.contributor-covenant.org/faq](https://www.contributor-covenant.org/faq). Translations are available at
-[https://www.contributor-covenant.org/translations](https://www.contributor-covenant.org/translations).
+This project adheres to the [Docling - Code of Conduct and Covenant](https://github.com/docling-project/community/blob/main/CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -2,85 +2,7 @@
 Our project welcomes external contributions. If you have an itch, please feel
 free to scratch it.

-To contribute code or documentation, please submit a [pull request](https://github.com/docling-project/docling/pulls).
-
-A good way to familiarize yourself with the codebase and contribution process is
-to look for and tackle low-hanging fruit in the [issue tracker](https://github.com/docling-project/docling/issues).
-Before embarking on a more ambitious contribution, please quickly [get in touch](#communication) with us.
-
-For general questions or support requests, please refer to the [discussion section](https://github.com/docling-project/docling/discussions).
-
-**Note: We appreciate your effort and want to avoid situations where a contribution
-requires extensive rework (by you or by us), sits in the backlog for a long time, or
-cannot be accepted at all!**
-
-### Proposing New Features
-
-If you would like to implement a new feature, please [raise an issue](https://github.com/docling-project/docling/issues)
-before sending a pull request so the feature can be discussed. This is to avoid
-you spending valuable time working on a feature that the project developers
-are not interested in accepting into the codebase.
-
-### Fixing Bugs
-
-If you would like to fix a bug, please [raise an issue](https://github.com/docling-project/docling/issues) before sending a
-pull request so it can be tracked.
-
-### Merge Approval
-
-The project maintainers use LGTM (Looks Good To Me) in comments on the code
-review to indicate acceptance. A change requires LGTMs from two of the
-maintainers of each component affected.
-
-For a list of the maintainers, see the [MAINTAINERS.md](MAINTAINERS.md) page.
-
-
-## Legal
-
-Each source file must include a license header for the MIT
-Software. Using the SPDX format is the simplest approach,
-e.g.
-
-```
-/*
-Copyright IBM Inc. All rights reserved.
-
-SPDX-License-Identifier: MIT
-*/
-```
-
-We have tried to make it as easy as possible to make contributions. This
-applies to how we handle the legal aspects of contribution. We use the
-same approach - the [Developer's Certificate of Origin 1.1 (DCO)](https://github.com/hyperledger/fabric/blob/master/docs/source/DCO1.1.txt) - that the Linux® Kernel [community](https://elinux.org/Developer_Certificate_Of_Origin)
-uses to manage code contributions.
-
-We simply ask that when submitting a patch for review, the developer
-must include a sign-off statement in the commit message.
-
-Here is an example Signed-off-by line, which indicates that the
-submitter accepts the DCO:
-
-```
-Signed-off-by: John Doe <john.doe@example.com>
-```
-
-You can include this automatically when you commit a change to your
-local git repository using the following command:
-
-```
-git commit -s
-```
-
-### New dependencies
-
-This project strictly adheres to using dependencies that are compatible with the MIT license to ensure maximum flexibility and permissiveness in its usage and distribution. As a result, dependencies licensed under restrictive terms such as GPL, LGPL, AGPL, or similar are explicitly excluded. These licenses impose additional requirements and limitations that are incompatible with the MIT license's minimal restrictions, potentially affecting derivative works and redistribution. By maintaining this policy, the project ensures simplicity and freedom for both developers and users, avoiding conflicts with stricter copyleft provisions.
-
-
-## Communication
-
-Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
-
-
+For more details on the contributing guidelines head to the Docling Project [community repository](https://github.com/docling-project/community).

 ## Developing

--- a/2
+++ b/2
@ -4,7 +4,7 @@ ENV GIT_SSH_COMMAND="ssh -o StrictHostKeyChecking=no"

 RUN apt-get update \
    && apt-get install -y libgl1 libglib2.0-0 curl wget git procps \
-    && apt-get clean
+    && rm -rf /var/lib/apt/lists/*

 # This will install torch with *only* cpu support
 # Remove the --extra-index-url part if you want to install all the gpu requirements
--- a/MAINTAINERS.md
+++ b/MAINTAINERS.md
@ -2,9 +2,6 @@

 - Christoph Auer - [@cau-git](https://github.com/cau-git)
 - Michele Dolfi - [@dolfim-ibm](https://github.com/dolfim-ibm)
- Maxim Lysak - [@maxmnemonic](https://github.com/maxmnemonic)
- Nikos Livathinos - [@nikos-livathinos](https://github.com/nikos-livathinos)
- Ahmed Nassar - [@nassarofficial](https://github.com/nassarofficial)
 - Panos Vagenas - [@vagenas](https://github.com/vagenas)
 - Peter Staar - [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)

--- a/README.md
+++ b/README.md
@ -21,6 +21,8 @@
 [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
 [![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
+[![Docling Actor](https://apify.com/actor-badge?actor=vancura/docling?fpr=docling)](https://apify.com/vancura/docling)
+[![LF AI & Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation&logoColor=fff&color=0094ff&labelColor=003778)](https://lfaidata.foundation/projects/)

 Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

@ -33,12 +35,12 @@ Docling simplifies document processing, parsing diverse formats — including ad
 * 🔒 Local execution capabilities for sensitive data and air-gapped environments
 * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
 * 🔍 Extensive OCR support for scanned PDFs and images
+* 🥚 Support of Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🆕
 * 💻 Simple and convenient CLI

 ### Coming soon

 * 📝 Metadata extraction, including title, authors, references & language
-* 📝 Inclusion of Visual Language Models ([SmolDocling](https://huggingface.co/blog/smolervlm#smoldocling))
 * 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
 * 📝 Complex chemistry understanding (Molecular structures)

@ -55,7 +57,7 @@ More [detailed installation instructions](https://docling-project.github.io/docl

 ## Getting started

-To convert individual documents, use `convert()`, for example:
+To convert individual documents with python, use `convert()`, for example:

 ```python
 from docling.document_converter import DocumentConverter
@ -69,6 +71,22 @@ print(result.document.export_to_markdown())  # output: "## Docling Technical Rep
 More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
 the docs.

+## CLI
+
+Docling has a built-in CLI to run conversions.
+
+```bash
+docling https://arxiv.org/pdf/2206.01062
+```
+
+You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
+```bash
+docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
+```
+This will use MLX acceleration on supported Apple Silicon hardware.
+
+Read more [here](https://docling-project.github.io/docling/usage/)
+
 ## Documentation

 Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
@ -119,9 +137,13 @@ If you use Docling in your projects, please consider citing the following:
 The Docling codebase is under MIT license.
 For individual model usage, please refer to the model licenses found in the original packages.

-## IBM ❤️ Open Source AI
+## LF AI & Data

-Docling has been brought to you by IBM.
+Docling is hosted as a project in the [LF AI & Data Foundation](https://lfaidata.foundation/projects/).
+
+### IBM ❤️ Open Source AI
+
+The project was started by the AI for knowledge team at IBM Research Zurich.

 [supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
 [docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
--- a/docling/backend/docling_parse_v4_backend.py
+++ b/docling/backend/docling_parse_v4_backend.py
@ -112,6 +112,7 @@ class DoclingParseV4PageBackend(PdfPageBackend):
            padbox.r = page_size.width - padbox.r
            padbox.t = page_size.height - padbox.t

+        with pypdfium2_lock:
            image = (
                self._ppage.render(
                    scale=scale * 1.5,
@ -119,16 +120,22 @@ class DoclingParseV4PageBackend(PdfPageBackend):
                    crop=padbox.as_tuple(),
                )
                .to_pil()
-            .resize(size=(round(cropbox.width * scale), round(cropbox.height * scale)))
+                .resize(
+                    size=(round(cropbox.width * scale), round(cropbox.height * scale))
+                )
            )  # We resize the image from 1.5x the given scale to make it sharper.

        return image

    def get_size(self) -> Size:
-        return Size(
-            width=self._dpage.dimension.width,
-            height=self._dpage.dimension.height,
-        )
+        with pypdfium2_lock:
+            return Size(width=self._ppage.get_width(), height=self._ppage.get_height())
+
+        # TODO: Take width and height from docling-parse.
+        # return Size(
+        #    width=self._dpage.dimension.width,
+        #    height=self._dpage.dimension.height,
+        # )

    def unload(self):
        self._ppage = None
--- a/docling/backend/mspowerpoint_backend.py
+++ b/docling/backend/mspowerpoint_backend.py
@ -16,6 +16,7 @@ from docling_core.types.doc import (
    TableCell,
    TableData,
 )
+from docling_core.types.doc.document import ContentLayer
 from PIL import Image, UnidentifiedImageError
 from pptx import Presentation
 from pptx.enum.shapes import MSO_SHAPE_TYPE, PP_PLACEHOLDER
@ -421,4 +422,21 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
            for shape in slide.shapes:
                handle_shapes(shape, parent_slide, slide_ind, doc, slide_size)

+            # Handle notes slide
+            if slide.has_notes_slide:
+                notes_slide = slide.notes_slide
+                notes_text = notes_slide.notes_text_frame.text.strip()
+                if notes_text:
+                    bbox = BoundingBox(l=0, t=0, r=0, b=0)
+                    prov = ProvenanceItem(
+                        page_no=slide_ind + 1, charspan=[0, len(notes_text)], bbox=bbox
+                    )
+                    doc.add_text(
+                        label=DocItemLabel.TEXT,
+                        parent=parent_slide,
+                        text=notes_text,
+                        prov=prov,
+                        content_layer=ContentLayer.FURNITURE,
+                    )
+
        return doc
--- a/docling/backend/msword_backend.py
+++ b/docling/backend/msword_backend.py
@ -53,6 +53,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
        self.max_levels: int = 10
        self.level_at_new_list: Optional[int] = None
        self.parents: dict[int, Optional[NodeItem]] = {}
+        self.numbered_headers: dict[int, int] = {}
        for i in range(-1, self.max_levels):
            self.parents[i] = None

@ -275,8 +276,10 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
                only_equations.append(latex_equation)
                texts_and_equations.append(latex_equation)

-        if "".join(only_texts) != text:
-            return text
+        if "".join(only_texts).strip() != text.strip():
+            # If we are not able to reconstruct the initial raw text
+            # do not try to parse equations and return the original
+            return text, []

        return "".join(texts_and_equations), only_equations

@ -344,7 +347,14 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
                parent=None, label=DocItemLabel.TITLE, text=text
            )
        elif "Heading" in p_style_id:
-            self.add_header(doc, p_level, text)
+            style_element = getattr(paragraph.style, "element", None)
+            if style_element:
+                is_numbered_style = (
+                    "<w:numPr>" in style_element.xml or "<w:numPr>" in element.xml
+                )
+            else:
+                is_numbered_style = False
+            self.add_header(doc, p_level, text, is_numbered_style)

        elif len(equations) > 0:
            if (raw_text is None or len(raw_text) == 0) and len(text) > 0:
@ -365,6 +375,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
                for eq in equations:
                    if len(text_tmp) == 0:
                        break
+
                    pre_eq_text = text_tmp.split(eq, maxsplit=1)[0]
                    text_tmp = text_tmp.split(eq, maxsplit=1)[1]
                    if len(pre_eq_text) > 0:
@ -412,7 +423,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
        return

    def add_header(
-        self, doc: DoclingDocument, curr_level: Optional[int], text: str
+        self,
+        doc: DoclingDocument,
+        curr_level: Optional[int],
+        text: str,
+        is_numbered_style: bool = False,
    ) -> None:
        level = self.get_level()
        if isinstance(curr_level, int):
@ -430,16 +445,43 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
                    if key >= curr_level:
                        self.parents[key] = None

-            self.parents[curr_level] = doc.add_heading(
-                parent=self.parents[curr_level - 1],
-                text=text,
-                level=curr_level,
-            )
+            current_level = curr_level
+            parent_level = curr_level - 1
+            add_level = curr_level
        else:
-            self.parents[self.level] = doc.add_heading(
-                parent=self.parents[self.level - 1],
+            current_level = self.level
+            parent_level = self.level - 1
+            add_level = 1
+
+        if is_numbered_style:
+            if add_level in self.numbered_headers:
+                self.numbered_headers[add_level] += 1
+            else:
+                self.numbered_headers[add_level] = 1
+            text = f"{self.numbered_headers[add_level]} {text}"
+
+            # Reset deeper levels
+            next_level = add_level + 1
+            while next_level in self.numbered_headers:
+                self.numbered_headers[next_level] = 0
+                next_level += 1
+
+            # Scan upper levels
+            previous_level = add_level - 1
+            while previous_level in self.numbered_headers:
+                # MSWord convention: no empty sublevels
+                # I.e., sub-sub section (2.0.1) without a sub-section (2.1)
+                # is processed as 2.1.1
+                if self.numbered_headers[previous_level] == 0:
+                    self.numbered_headers[previous_level] += 1
+
+                text = f"{self.numbered_headers[previous_level]}.{text}"
+                previous_level -= 1
+
+        self.parents[current_level] = doc.add_heading(
+            parent=self.parents[parent_level],
            text=text,
-                level=1,
+            level=add_level,
        )
        return

--- a/docling/cli/main.py
+++ b/docling/cli/main.py
@ -9,6 +9,7 @@ import warnings
 from pathlib import Path
 from typing import Annotated, Dict, Iterable, List, Optional, Type

+import rich.table
 import typer
 from docling_core.types.doc import ImageRefMode
 from docling_core.utils.file import resolve_source_to_path
@ -30,18 +31,22 @@ from docling.datamodel.pipeline_options import (
    AcceleratorDevice,
    AcceleratorOptions,
    EasyOcrOptions,
-    OcrEngine,
-    OcrMacOptions,
    OcrOptions,
+    PaginatedPipelineOptions,
    PdfBackend,
+    PdfPipeline,
    PdfPipelineOptions,
-    RapidOcrOptions,
    TableFormerMode,
-    TesseractCliOcrOptions,
-    TesseractOcrOptions,
+    VlmModelType,
+    VlmPipelineOptions,
+    granite_vision_vlm_conversion_options,
+    smoldocling_vlm_conversion_options,
+    smoldocling_vlm_mlx_conversion_options,
 )
 from docling.datamodel.settings import settings
 from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption
+from docling.models.factories import get_ocr_factory
+from docling.pipeline.vlm_pipeline import VlmPipeline

 warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
 warnings.filterwarnings(action="ignore", category=FutureWarning, module="easyocr")
@ -49,8 +54,11 @@ warnings.filterwarnings(action="ignore", category=FutureWarning, module="easyocr
 _log = logging.getLogger(__name__)
 from rich.console import Console

+console = Console()
 err_console = Console(stderr=True)

+ocr_factory_internal = get_ocr_factory(allow_external_plugins=False)
+ocr_engines_enum_internal = ocr_factory_internal.get_enum()

 app = typer.Typer(
    name="Docling",
@ -78,6 +86,24 @@ def version_callback(value: bool):
        raise typer.Exit()


+def show_external_plugins_callback(value: bool):
+    if value:
+        ocr_factory_all = get_ocr_factory(allow_external_plugins=True)
+        table = rich.table.Table(title="Available OCR engines")
+        table.add_column("Name", justify="right")
+        table.add_column("Plugin")
+        table.add_column("Package")
+        for meta in ocr_factory_all.registered_meta.values():
+            if not meta.module.startswith("docling."):
+                table.add_row(
+                    f"[bold]{meta.kind}[/bold]",
+                    meta.plugin_name,
+                    meta.module.split(".")[0],
+                )
+        rich.print(table)
+        raise typer.Exit()
+
+
 def export_documents(
    conv_results: Iterable[ConversionResult],
    output_dir: Path,
@ -182,6 +208,14 @@ def convert(
            help="Image export mode for the document (only in case of JSON, Markdown or HTML). With `placeholder`, only the position of the image is marked in the output. In `embedded` mode, the image is embedded as base64 encoded string. In `referenced` mode, the image is exported in PNG format and referenced from the main exported document.",
        ),
    ] = ImageRefMode.EMBEDDED,
+    pipeline: Annotated[
+        PdfPipeline,
+        typer.Option(..., help="Choose the pipeline to process PDF or image files."),
+    ] = PdfPipeline.STANDARD,
+    vlm_model: Annotated[
+        VlmModelType,
+        typer.Option(..., help="Choose the VLM model to use with PDF or image files."),
+    ] = VlmModelType.SMOLDOCLING,
    ocr: Annotated[
        bool,
        typer.Option(
@ -196,8 +230,16 @@ def convert(
        ),
    ] = False,
    ocr_engine: Annotated[
-        OcrEngine, typer.Option(..., help="The OCR engine to use.")
-    ] = OcrEngine.EASYOCR,
+        str,
+        typer.Option(
+            ...,
+            help=(
+                f"The OCR engine to use. When --allow-external-plugins is *not* set, the available values are: "
+                f"{', '.join((o.value for o in ocr_engines_enum_internal))}. "
+                f"Use the option --show-external-plugins to see the options allowed with external plugins."
+            ),
+        ),
+    ] = EasyOcrOptions.kind,
    ocr_lang: Annotated[
        Optional[str],
        typer.Option(
@ -241,6 +283,21 @@ def convert(
            ..., help="Must be enabled when using models connecting to remote services."
        ),
    ] = False,
+    allow_external_plugins: Annotated[
+        bool,
+        typer.Option(
+            ..., help="Must be enabled for loading modules from third-party plugins."
+        ),
+    ] = False,
+    show_external_plugins: Annotated[
+        bool,
+        typer.Option(
+            ...,
+            help="List the third-party plugins which are available when the option --allow-external-plugins is set.",
+            callback=show_external_plugins_callback,
+            is_eager=True,
+        ),
+    ] = False,
    abort_on_error: Annotated[
        bool,
        typer.Option(
@ -368,25 +425,22 @@ def convert(
        export_txt = OutputFormat.TEXT in to_formats
        export_doctags = OutputFormat.DOCTAGS in to_formats

-        if ocr_engine == OcrEngine.EASYOCR:
-            ocr_options: OcrOptions = EasyOcrOptions(force_full_page_ocr=force_ocr)
-        elif ocr_engine == OcrEngine.TESSERACT_CLI:
-            ocr_options = TesseractCliOcrOptions(force_full_page_ocr=force_ocr)
-        elif ocr_engine == OcrEngine.TESSERACT:
-            ocr_options = TesseractOcrOptions(force_full_page_ocr=force_ocr)
-        elif ocr_engine == OcrEngine.OCRMAC:
-            ocr_options = OcrMacOptions(force_full_page_ocr=force_ocr)
-        elif ocr_engine == OcrEngine.RAPIDOCR:
-            ocr_options = RapidOcrOptions(force_full_page_ocr=force_ocr)
-        else:
-            raise RuntimeError(f"Unexpected OCR engine type {ocr_engine}")
+        ocr_factory = get_ocr_factory(allow_external_plugins=allow_external_plugins)
+        ocr_options: OcrOptions = ocr_factory.create_options(  # type: ignore
+            kind=ocr_engine,
+            force_full_page_ocr=force_ocr,
+        )

        ocr_lang_list = _split_list(ocr_lang)
        if ocr_lang_list is not None:
            ocr_options.lang = ocr_lang_list

        accelerator_options = AcceleratorOptions(num_threads=num_threads, device=device)
+        pipeline_options: PaginatedPipelineOptions
+
+        if pipeline == PdfPipeline.STANDARD:
            pipeline_options = PdfPipelineOptions(
+                allow_external_plugins=allow_external_plugins,
                enable_remote_services=enable_remote_services,
                accelerator_options=accelerator_options,
                do_ocr=ocr,
@ -410,9 +464,6 @@ def convert(
                )
                pipeline_options.images_scale = 2

-        if artifacts_path is not None:
-            pipeline_options.artifacts_path = artifacts_path
-
            backend: Type[PdfDocumentBackend]
            if pdf_backend == PdfBackend.DLPARSE_V1:
                backend = DoclingParseDocumentBackend
@ -429,6 +480,33 @@ def convert(
                pipeline_options=pipeline_options,
                backend=backend,  # pdf_backend
            )
+        elif pipeline == PdfPipeline.VLM:
+            pipeline_options = VlmPipelineOptions()
+
+            if vlm_model == VlmModelType.GRANITE_VISION:
+                pipeline_options.vlm_options = granite_vision_vlm_conversion_options
+            elif vlm_model == VlmModelType.SMOLDOCLING:
+                pipeline_options.vlm_options = smoldocling_vlm_conversion_options
+                if sys.platform == "darwin":
+                    try:
+                        import mlx_vlm
+
+                        pipeline_options.vlm_options = (
+                            smoldocling_vlm_mlx_conversion_options
+                        )
+                    except ImportError:
+                        _log.warning(
+                            "To run SmolDocling faster, please install mlx-vlm:\n"
+                            "pip install mlx-vlm"
+                        )
+
+            pdf_format_option = PdfFormatOption(
+                pipeline_cls=VlmPipeline, pipeline_options=pipeline_options
+            )
+
+        if artifacts_path is not None:
+            pipeline_options.artifacts_path = artifacts_path
+
        format_options: Dict[InputFormat, FormatOption] = {
            InputFormat.PDF: pdf_format_option,
            InputFormat.IMAGE: pdf_format_option,
--- a/docling/datamodel/pipeline_options.py
+++ b/docling/datamodel/pipeline_options.py
@ -1,10 +1,9 @@
 import logging
 import os
 import re
-import warnings
 from enum import Enum
 from pathlib import Path
-from typing import Annotated, Any, Dict, List, Literal, Optional, Union
+from typing import Any, ClassVar, Dict, List, Literal, Optional, Union

 from pydantic import (
    AnyUrl,
@ -13,13 +12,8 @@ from pydantic import (
    Field,
    field_validator,
    model_validator,
-    validator,
-)
-from pydantic_settings import (
-    BaseSettings,
-    PydanticBaseSettingsSource,
-    SettingsConfigDict,
 )
+from pydantic_settings import BaseSettings, SettingsConfigDict
 from typing_extensions import deprecated

 _log = logging.getLogger(__name__)
@ -83,6 +77,12 @@ class AcceleratorOptions(BaseSettings):
        return data


+class BaseOptions(BaseModel):
+    """Base class for options."""
+
+    kind: ClassVar[str]
+
+
 class TableFormerMode(str, Enum):
    """Modes for the TableFormer model."""

@ -102,10 +102,9 @@ class TableStructureOptions(BaseModel):
    mode: TableFormerMode = TableFormerMode.ACCURATE


-class OcrOptions(BaseModel):
+class OcrOptions(BaseOptions):
    """OCR options."""

-    kind: str
    lang: List[str]
    force_full_page_ocr: bool = False  # If enabled a full page OCR is always applied
    bitmap_area_threshold: float = (
@ -116,7 +115,7 @@ class OcrOptions(BaseModel):
 class RapidOcrOptions(OcrOptions):
    """Options for the RapidOCR engine."""

-    kind: Literal["rapidocr"] = "rapidocr"
+    kind: ClassVar[Literal["rapidocr"]] = "rapidocr"

    # English and chinese are the most commly used models and have been tested with RapidOCR.
    lang: List[str] = [
@ -155,7 +154,7 @@ class RapidOcrOptions(OcrOptions):
 class EasyOcrOptions(OcrOptions):
    """Options for the EasyOCR engine."""

-    kind: Literal["easyocr"] = "easyocr"
+    kind: ClassVar[Literal["easyocr"]] = "easyocr"
    lang: List[str] = ["fr", "de", "es", "en"]

    use_gpu: Optional[bool] = None
@ -175,7 +174,7 @@ class EasyOcrOptions(OcrOptions):
 class TesseractCliOcrOptions(OcrOptions):
    """Options for the TesseractCli engine."""

-    kind: Literal["tesseract"] = "tesseract"
+    kind: ClassVar[Literal["tesseract"]] = "tesseract"
    lang: List[str] = ["fra", "deu", "spa", "eng"]
    tesseract_cmd: str = "tesseract"
    path: Optional[str] = None
@ -188,7 +187,7 @@ class TesseractCliOcrOptions(OcrOptions):
 class TesseractOcrOptions(OcrOptions):
    """Options for the Tesseract engine."""

-    kind: Literal["tesserocr"] = "tesserocr"
+    kind: ClassVar[Literal["tesserocr"]] = "tesserocr"
    lang: List[str] = ["fra", "deu", "spa", "eng"]
    path: Optional[str] = None

@ -200,7 +199,7 @@ class TesseractOcrOptions(OcrOptions):
 class OcrMacOptions(OcrOptions):
    """Options for the Mac OCR engine."""

-    kind: Literal["ocrmac"] = "ocrmac"
+    kind: ClassVar[Literal["ocrmac"]] = "ocrmac"
    lang: List[str] = ["fr-FR", "de-DE", "es-ES", "en-US"]
    recognition: str = "accurate"
    framework: str = "vision"
@ -210,8 +209,7 @@ class OcrMacOptions(OcrOptions):
    )


-class PictureDescriptionBaseOptions(BaseModel):
-    kind: str
+class PictureDescriptionBaseOptions(BaseOptions):
    batch_size: int = 8
    scale: float = 2

@ -221,7 +219,7 @@ class PictureDescriptionBaseOptions(BaseModel):


 class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):
-    kind: Literal["api"] = "api"
+    kind: ClassVar[Literal["api"]] = "api"

    url: AnyUrl = AnyUrl("http://localhost:8000/v1/chat/completions")
    headers: Dict[str, str] = {}
@ -233,7 +231,7 @@ class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):


 class PictureDescriptionVlmOptions(PictureDescriptionBaseOptions):
-    kind: Literal["vlm"] = "vlm"
+    kind: ClassVar[Literal["vlm"]] = "vlm"

    repo_id: str
    prompt: str = "Describe this image in a few sentences."
@ -265,6 +263,11 @@ class ResponseFormat(str, Enum):
    MARKDOWN = "markdown"


+class InferenceFramework(str, Enum):
+    MLX = "mlx"
+    TRANSFORMERS = "transformers"
+
+
 class HuggingFaceVlmOptions(BaseVlmOptions):
    kind: Literal["hf_model_options"] = "hf_model_options"

@ -273,6 +276,7 @@ class HuggingFaceVlmOptions(BaseVlmOptions):
    llm_int8_threshold: float = 6.0
    quantized: bool = False

+    inference_framework: InferenceFramework
    response_format: ResponseFormat

    @property
@ -280,10 +284,19 @@ class HuggingFaceVlmOptions(BaseVlmOptions):
        return self.repo_id.replace("/", "--")


+smoldocling_vlm_mlx_conversion_options = HuggingFaceVlmOptions(
+    repo_id="ds4sd/SmolDocling-256M-preview-mlx-bf16",
+    prompt="Convert this page to docling.",
+    response_format=ResponseFormat.DOCTAGS,
+    inference_framework=InferenceFramework.MLX,
+)
+
+
 smoldocling_vlm_conversion_options = HuggingFaceVlmOptions(
    repo_id="ds4sd/SmolDocling-256M-preview",
    prompt="Convert this page to docling.",
    response_format=ResponseFormat.DOCTAGS,
+    inference_framework=InferenceFramework.TRANSFORMERS,
 )

 granite_vision_vlm_conversion_options = HuggingFaceVlmOptions(
@ -291,9 +304,15 @@ granite_vision_vlm_conversion_options = HuggingFaceVlmOptions(
    # prompt="OCR the full page to markdown.",
    prompt="OCR this image.",
    response_format=ResponseFormat.MARKDOWN,
+    inference_framework=InferenceFramework.TRANSFORMERS,
 )


+class VlmModelType(str, Enum):
+    SMOLDOCLING = "smoldocling"
+    GRANITE_VISION = "granite_vision"
+
+
 # Define an enum for the backend options
 class PdfBackend(str, Enum):
    """Enum of valid PDF backends."""
@ -305,6 +324,7 @@ class PdfBackend(str, Enum):


 # Define an enum for the ocr engines
+@deprecated("Use ocr_factory.registered_enum")
 class OcrEngine(str, Enum):
    """Enum of valid OCR engines."""

@ -324,16 +344,18 @@ class PipelineOptions(BaseModel):
    document_timeout: Optional[float] = None
    accelerator_options: AcceleratorOptions = AcceleratorOptions()
    enable_remote_services: bool = False
+    allow_external_plugins: bool = False


 class PaginatedPipelineOptions(PipelineOptions):
+    artifacts_path: Optional[Union[Path, str]] = None
+
    images_scale: float = 1.0
    generate_page_images: bool = False
    generate_picture_images: bool = False


 class VlmPipelineOptions(PaginatedPipelineOptions):
-    artifacts_path: Optional[Union[Path, str]] = None

    generate_page_images: bool = True
    force_backend_text: bool = (
@ -346,7 +368,6 @@ class VlmPipelineOptions(PaginatedPipelineOptions):
 class PdfPipelineOptions(PaginatedPipelineOptions):
    """Options for the PDF pipeline."""

-    artifacts_path: Optional[Union[Path, str]] = None
    do_table_structure: bool = True  # True: perform table structure extraction
    do_ocr: bool = True  # True: perform OCR, replace programmatic PDF text
    do_code_enrichment: bool = False  # True: perform code OCR
@ -359,17 +380,10 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
    # If True, text from backend will be used instead of generated text

    table_structure_options: TableStructureOptions = TableStructureOptions()
-    ocr_options: Union[
-        EasyOcrOptions,
-        TesseractCliOcrOptions,
-        TesseractOcrOptions,
-        OcrMacOptions,
-        RapidOcrOptions,
-    ] = Field(EasyOcrOptions(), discriminator="kind")
-    picture_description_options: Annotated[
-        Union[PictureDescriptionApiOptions, PictureDescriptionVlmOptions],
-        Field(discriminator="kind"),
-    ] = smolvlm_picture_description
+    ocr_options: OcrOptions = EasyOcrOptions()
+    picture_description_options: PictureDescriptionBaseOptions = (
+        smolvlm_picture_description
+    )

    images_scale: float = 1.0
    generate_page_images: bool = False
@ -384,3 +398,8 @@ class PdfPipelineOptions(PaginatedPipelineOptions):
    )

    generate_parsed_pages: bool = False
+
+
+class PdfPipeline(str, Enum):
+    STANDARD = "standard"
+    VLM = "vlm"
--- a/docling/document_converter.py
+++ b/docling/document_converter.py
@ -1,3 +1,4 @@
+import hashlib
 import logging
 import math
 import sys
@ -181,7 +182,14 @@ class DocumentConverter:
            )
            for format in self.allowed_formats
        }
-        self.initialized_pipelines: Dict[Type[BasePipeline], BasePipeline] = {}
+        self.initialized_pipelines: Dict[
+            Tuple[Type[BasePipeline], str], BasePipeline
+        ] = {}
+
+    def _get_pipeline_options_hash(self, pipeline_options: PipelineOptions) -> str:
+        """Generate a hash of pipeline options to use as part of the cache key."""
+        options_str = str(pipeline_options.model_dump())
+        return hashlib.md5(options_str.encode("utf-8")).hexdigest()

    def initialize_pipeline(self, format: InputFormat):
        """Initialize the conversion pipeline for the selected format."""
@ -279,31 +287,36 @@ class DocumentConverter:
                yield item

    def _get_pipeline(self, doc_format: InputFormat) -> Optional[BasePipeline]:
+        """Retrieve or initialize a pipeline, reusing instances based on class and options."""
        fopt = self.format_to_options.get(doc_format)

-        if fopt is None:
+        if fopt is None or fopt.pipeline_options is None:
            return None
-        else:
+
        pipeline_class = fopt.pipeline_cls
        pipeline_options = fopt.pipeline_options
+        options_hash = self._get_pipeline_options_hash(pipeline_options)

-        if pipeline_options is None:
-            return None
-        # TODO this will ignore if different options have been defined for the same pipeline class.
-        if (
-            pipeline_class not in self.initialized_pipelines
-            or self.initialized_pipelines[pipeline_class].pipeline_options
-            != pipeline_options
-        ):
-            self.initialized_pipelines[pipeline_class] = pipeline_class(
+        # Use a composite key to cache pipelines
+        cache_key = (pipeline_class, options_hash)
+
+        if cache_key not in self.initialized_pipelines:
+            _log.info(
+                f"Initializing pipeline for {pipeline_class.__name__} with options hash {options_hash}"
+            )
+            self.initialized_pipelines[cache_key] = pipeline_class(
                pipeline_options=pipeline_options
            )
-        return self.initialized_pipelines[pipeline_class]
+        else:
+            _log.debug(
+                f"Reusing cached pipeline for {pipeline_class.__name__} with options hash {options_hash}"
+            )
+
+        return self.initialized_pipelines[cache_key]

    def _process_document(
        self, in_doc: InputDocument, raises_on_error: bool
    ) -> ConversionResult:
-
        valid = (
            self.allowed_formats is not None and in_doc.format in self.allowed_formats
        )
@ -345,7 +358,6 @@ class DocumentConverter:
        else:
            if raises_on_error:
                raise ConversionError(f"Input document {in_doc.file} is not valid.")
-
            else:
                # invalid doc or not of desired format
                conv_res = ConversionResult(
--- a/docling/models/base_model.py
+++ b/docling/models/base_model.py
@ -1,14 +1,22 @@
 from abc import ABC, abstractmethod
-from typing import Any, Generic, Iterable, Optional
+from typing import Any, Generic, Iterable, Optional, Protocol, Type

 from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem
 from typing_extensions import TypeVar

 from docling.datamodel.base_models import ItemAndImageEnrichmentElement, Page
 from docling.datamodel.document import ConversionResult
+from docling.datamodel.pipeline_options import BaseOptions
 from docling.datamodel.settings import settings


+class BaseModelWithOptions(Protocol):
+    @classmethod
+    def get_options_type(cls) -> Type[BaseOptions]: ...
+
+    def __init__(self, *, options: BaseOptions, **kwargs): ...
+
+
 class BasePageModel(ABC):
    @abstractmethod
    def __call__(
--- a/docling/models/base_ocr_model.py
+++ b/docling/models/base_ocr_model.py
@ -2,7 +2,7 @@ import copy
 import logging
 from abc import abstractmethod
 from pathlib import Path
-from typing import Iterable, List
+from typing import Iterable, List, Optional, Type

 import numpy as np
 from docling_core.types.doc import BoundingBox, CoordOrigin
@ -13,15 +13,22 @@ from scipy.ndimage import binary_dilation, find_objects, label

 from docling.datamodel.base_models import Page
 from docling.datamodel.document import ConversionResult
-from docling.datamodel.pipeline_options import OcrOptions
+from docling.datamodel.pipeline_options import AcceleratorOptions, OcrOptions
 from docling.datamodel.settings import settings
-from docling.models.base_model import BasePageModel
+from docling.models.base_model import BaseModelWithOptions, BasePageModel

 _log = logging.getLogger(__name__)


-class BaseOcrModel(BasePageModel):
-    def __init__(self, enabled: bool, options: OcrOptions):
+class BaseOcrModel(BasePageModel, BaseModelWithOptions):
+    def __init__(
+        self,
+        *,
+        enabled: bool,
+        artifacts_path: Optional[Path],
+        options: OcrOptions,
+        accelerator_options: AcceleratorOptions,
+    ):
        self.enabled = enabled
        self.options = options

@ -186,3 +193,8 @@ class BaseOcrModel(BasePageModel):
        self, conv_res: ConversionResult, page_batch: Iterable[Page]
    ) -> Iterable[Page]:
        pass
+
+    @classmethod
+    @abstractmethod
+    def get_options_type(cls) -> Type[OcrOptions]:
+        pass
--- a/docling/models/easyocr_model.py
+++ b/docling/models/easyocr_model.py
@ -2,7 +2,7 @@ import logging
 import warnings
 import zipfile
 from pathlib import Path
-from typing import Iterable, List, Optional
+from typing import Iterable, List, Optional, Type

 import numpy
 from docling_core.types.doc import BoundingBox, CoordOrigin
@ -14,6 +14,7 @@ from docling.datamodel.pipeline_options import (
    AcceleratorDevice,
    AcceleratorOptions,
    EasyOcrOptions,
+    OcrOptions,
 )
 from docling.datamodel.settings import settings
 from docling.models.base_ocr_model import BaseOcrModel
@ -34,7 +35,12 @@ class EasyOcrModel(BaseOcrModel):
        options: EasyOcrOptions,
        accelerator_options: AcceleratorOptions,
    ):
-        super().__init__(enabled=enabled, options=options)
+        super().__init__(
+            enabled=enabled,
+            artifacts_path=artifacts_path,
+            options=options,
+            accelerator_options=accelerator_options,
+        )
        self.options: EasyOcrOptions

        self.scale = 3  # multiplier for 72 dpi == 216 dpi.
@ -180,3 +186,7 @@ class EasyOcrModel(BaseOcrModel):
                    self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)

                yield page
+
+    @classmethod
+    def get_options_type(cls) -> Type[OcrOptions]:
+        return EasyOcrOptions
--- a/docling/models/factories/init.py
+++ b/docling/models/factories/init.py
@ -0,0 +1,27 @@
+import logging
+from functools import lru_cache
+
+from docling.models.factories.ocr_factory import OcrFactory
+from docling.models.factories.picture_description_factory import (
+    PictureDescriptionFactory,
+)
+
+logger = logging.getLogger(__name__)
+
+
+@lru_cache()
+def get_ocr_factory(allow_external_plugins: bool = False) -> OcrFactory:
+    factory = OcrFactory()
+    factory.load_from_plugins(allow_external_plugins=allow_external_plugins)
+    logger.info("Registered ocr engines: %r", factory.registered_kind)
+    return factory
+
+
+@lru_cache()
+def get_picture_description_factory(
+    allow_external_plugins: bool = False,
+) -> PictureDescriptionFactory:
+    factory = PictureDescriptionFactory()
+    factory.load_from_plugins(allow_external_plugins=allow_external_plugins)
+    logger.info("Registered picture descriptions: %r", factory.registered_kind)
+    return factory
--- a/docling/models/factories/base_factory.py
+++ b/docling/models/factories/base_factory.py
@ -0,0 +1,122 @@
+import enum
+import logging
+from abc import ABCMeta
+from typing import Generic, Optional, Type, TypeVar
+
+from pluggy import PluginManager
+from pydantic import BaseModel
+
+from docling.datamodel.pipeline_options import BaseOptions
+from docling.models.base_model import BaseModelWithOptions
+
+A = TypeVar("A", bound=BaseModelWithOptions)
+
+
+logger = logging.getLogger(__name__)
+
+
+class FactoryMeta(BaseModel):
+    kind: str
+    plugin_name: str
+    module: str
+
+
+class BaseFactory(Generic[A], metaclass=ABCMeta):
+    default_plugin_name = "docling"
+
+    def __init__(self, plugin_attr_name: str, plugin_name=default_plugin_name):
+        self.plugin_name = plugin_name
+        self.plugin_attr_name = plugin_attr_name
+
+        self._classes: dict[Type[BaseOptions], Type[A]] = {}
+        self._meta: dict[Type[BaseOptions], FactoryMeta] = {}
+
+    @property
+    def registered_kind(self) -> list[str]:
+        return list(opt.kind for opt in self._classes.keys())
+
+    def get_enum(self) -> enum.Enum:
+        return enum.Enum(
+            self.plugin_attr_name + "_enum",
+            names={kind: kind for kind in self.registered_kind},
+            type=str,
+            module=__name__,
+        )
+
+    @property
+    def classes(self):
+        return self._classes
+
+    @property
+    def registered_meta(self):
+        return self._meta
+
+    def create_instance(self, options: BaseOptions, **kwargs) -> A:
+        try:
+            _cls = self._classes[type(options)]
+            return _cls(options=options, **kwargs)
+        except KeyError:
+            raise RuntimeError(self._err_msg_on_class_not_found(options.kind))
+
+    def create_options(self, kind: str, *args, **kwargs) -> BaseOptions:
+        for opt_cls, _ in self._classes.items():
+            if opt_cls.kind == kind:
+                return opt_cls(*args, **kwargs)
+        raise RuntimeError(self._err_msg_on_class_not_found(kind))
+
+    def _err_msg_on_class_not_found(self, kind: str):
+        msg = []
+
+        for opt, cls in self._classes.items():
+            msg.append(f"\t{opt.kind!r} => {cls!r}")
+
+        msg_str = "\n".join(msg)
+
+        return f"No class found with the name {kind!r}, known classes are:\n{msg_str}"
+
+    def register(self, cls: Type[A], plugin_name: str, plugin_module_name: str):
+        opt_type = cls.get_options_type()
+
+        if opt_type in self._classes:
+            raise ValueError(
+                f"{opt_type.kind!r} already registered to class {self._classes[opt_type]!r}"
+            )
+
+        self._classes[opt_type] = cls
+        self._meta[opt_type] = FactoryMeta(
+            kind=opt_type.kind, plugin_name=plugin_name, module=plugin_module_name
+        )
+
+    def load_from_plugins(
+        self, plugin_name: Optional[str] = None, allow_external_plugins: bool = False
+    ):
+        plugin_name = plugin_name or self.plugin_name
+
+        plugin_manager = PluginManager(plugin_name)
+        plugin_manager.load_setuptools_entrypoints(plugin_name)
+
+        for plugin_name, plugin_module in plugin_manager.list_name_plugin():
+            plugin_module_name = str(plugin_module.__name__)  # type: ignore
+
+            if not allow_external_plugins and not plugin_module_name.startswith(
+                "docling."
+            ):
+                logger.warning(
+                    f"The plugin {plugin_name} will not be loaded because Docling is being executed with allow_external_plugins=false."
+                )
+                continue
+
+            attr = getattr(plugin_module, self.plugin_attr_name, None)
+
+            if callable(attr):
+                logger.info("Loading plugin %r", plugin_name)
+
+                config = attr()
+                self.process_plugin(config, plugin_name, plugin_module_name)
+
+    def process_plugin(self, config, plugin_name: str, plugin_module_name: str):
+        for item in config[self.plugin_attr_name]:
+            try:
+                self.register(item, plugin_name, plugin_module_name)
+            except ValueError:
+                logger.warning("%r already registered", item)
--- a/docling/models/factories/ocr_factory.py
+++ b/docling/models/factories/ocr_factory.py
@ -0,0 +1,11 @@
+import logging
+
+from docling.models.base_ocr_model import BaseOcrModel
+from docling.models.factories.base_factory import BaseFactory
+
+logger = logging.getLogger(__name__)
+
+
+class OcrFactory(BaseFactory[BaseOcrModel]):
+    def __init__(self, *args, **kwargs):
+        super().__init__("ocr_engines", *args, **kwargs)
--- a/docling/models/factories/picture_description_factory.py
+++ b/docling/models/factories/picture_description_factory.py
@ -0,0 +1,11 @@
+import logging
+
+from docling.models.factories.base_factory import BaseFactory
+from docling.models.picture_description_base_model import PictureDescriptionBaseModel
+
+logger = logging.getLogger(__name__)
+
+
+class PictureDescriptionFactory(BaseFactory[PictureDescriptionBaseModel]):
+    def __init__(self, *args, **kwargs):
+        super().__init__("picture_description", *args, **kwargs)
--- a/docling/models/hf_mlx_model.py
+++ b/docling/models/hf_mlx_model.py
@ -0,0 +1,137 @@
+import logging
+import time
+from pathlib import Path
+from typing import Iterable, List, Optional
+
+from docling.datamodel.base_models import Page, VlmPrediction
+from docling.datamodel.document import ConversionResult
+from docling.datamodel.pipeline_options import (
+    AcceleratorDevice,
+    AcceleratorOptions,
+    HuggingFaceVlmOptions,
+)
+from docling.datamodel.settings import settings
+from docling.models.base_model import BasePageModel
+from docling.utils.accelerator_utils import decide_device
+from docling.utils.profiling import TimeRecorder
+
+_log = logging.getLogger(__name__)
+
+
+class HuggingFaceMlxModel(BasePageModel):
+
+    def __init__(
+        self,
+        enabled: bool,
+        artifacts_path: Optional[Path],
+        accelerator_options: AcceleratorOptions,
+        vlm_options: HuggingFaceVlmOptions,
+    ):
+        self.enabled = enabled
+
+        self.vlm_options = vlm_options
+
+        if self.enabled:
+
+            try:
+                from mlx_vlm import generate, load  # type: ignore
+                from mlx_vlm.prompt_utils import apply_chat_template  # type: ignore
+                from mlx_vlm.utils import load_config, stream_generate  # type: ignore
+            except ImportError:
+                raise ImportError(
+                    "mlx-vlm is not installed. Please install it via `pip install mlx-vlm` to use MLX VLM models."
+                )
+
+            repo_cache_folder = vlm_options.repo_id.replace("/", "--")
+            self.apply_chat_template = apply_chat_template
+            self.stream_generate = stream_generate
+
+            # PARAMETERS:
+            if artifacts_path is None:
+                artifacts_path = self.download_models(self.vlm_options.repo_id)
+            elif (artifacts_path / repo_cache_folder).exists():
+                artifacts_path = artifacts_path / repo_cache_folder
+
+            self.param_question = vlm_options.prompt  # "Perform Layout Analysis."
+
+            ## Load the model
+            self.vlm_model, self.processor = load(artifacts_path)
+            self.config = load_config(artifacts_path)
+
+    @staticmethod
+    def download_models(
+        repo_id: str,
+        local_dir: Optional[Path] = None,
+        force: bool = False,
+        progress: bool = False,
+    ) -> Path:
+        from huggingface_hub import snapshot_download
+        from huggingface_hub.utils import disable_progress_bars
+
+        if not progress:
+            disable_progress_bars()
+        download_path = snapshot_download(
+            repo_id=repo_id,
+            force_download=force,
+            local_dir=local_dir,
+            # revision="v0.0.1",
+        )
+
+        return Path(download_path)
+
+    def __call__(
+        self, conv_res: ConversionResult, page_batch: Iterable[Page]
+    ) -> Iterable[Page]:
+        for page in page_batch:
+            assert page._backend is not None
+            if not page._backend.is_valid():
+                yield page
+            else:
+                with TimeRecorder(conv_res, "vlm"):
+                    assert page.size is not None
+
+                    hi_res_image = page.get_image(scale=2.0)  # 144dpi
+                    # hi_res_image = page.get_image(scale=1.0)  # 72dpi
+
+                    if hi_res_image is not None:
+                        im_width, im_height = hi_res_image.size
+
+                    # populate page_tags with predicted doc tags
+                    page_tags = ""
+
+                    if hi_res_image:
+                        if hi_res_image.mode != "RGB":
+                            hi_res_image = hi_res_image.convert("RGB")
+
+                    prompt = self.apply_chat_template(
+                        self.processor, self.config, self.param_question, num_images=1
+                    )
+
+                    start_time = time.time()
+                    # Call model to generate:
+                    output = ""
+                    for token in self.stream_generate(
+                        self.vlm_model,
+                        self.processor,
+                        prompt,
+                        [hi_res_image],
+                        max_tokens=4096,
+                        verbose=False,
+                    ):
+                        output += token.text
+                        if "</doctag>" in token.text:
+                            break
+
+                    generation_time = time.time() - start_time
+                    page_tags = output
+
+                    # inference_time = time.time() - start_time
+                    # tokens_per_second = num_tokens / generation_time
+                    # print("")
+                    # print(f"Page Inference Time: {inference_time:.2f} seconds")
+                    # print(f"Total tokens on page: {num_tokens:.2f}")
+                    # print(f"Tokens/sec: {tokens_per_second:.2f}")
+                    # print("")
+                    page.predictions.vlm_response = VlmPrediction(text=page_tags)
+
+                yield page
--- a/docling/models/ocr_mac_model.py
+++ b/docling/models/ocr_mac_model.py
@ -1,13 +1,19 @@
 import logging
+import sys
 import tempfile
-from typing import Iterable, Optional, Tuple
+from pathlib import Path
+from typing import Iterable, Optional, Tuple, Type

 from docling_core.types.doc import BoundingBox, CoordOrigin
 from docling_core.types.doc.page import BoundingRectangle, TextCell

 from docling.datamodel.base_models import Page
 from docling.datamodel.document import ConversionResult
-from docling.datamodel.pipeline_options import OcrMacOptions
+from docling.datamodel.pipeline_options import (
+    AcceleratorOptions,
+    OcrMacOptions,
+    OcrOptions,
+)
 from docling.datamodel.settings import settings
 from docling.models.base_ocr_model import BaseOcrModel
 from docling.utils.profiling import TimeRecorder
@ -16,13 +22,26 @@ _log = logging.getLogger(__name__)


 class OcrMacModel(BaseOcrModel):
-    def __init__(self, enabled: bool, options: OcrMacOptions):
-        super().__init__(enabled=enabled, options=options)
+    def __init__(
+        self,
+        enabled: bool,
+        artifacts_path: Optional[Path],
+        options: OcrMacOptions,
+        accelerator_options: AcceleratorOptions,
+    ):
+        super().__init__(
+            enabled=enabled,
+            artifacts_path=artifacts_path,
+            options=options,
+            accelerator_options=accelerator_options,
+        )
        self.options: OcrMacOptions

        self.scale = 3  # multiplier for 72 dpi == 216 dpi.

        if self.enabled:
+            if "darwin" != sys.platform:
+                raise RuntimeError(f"OcrMac is only supported on Mac.")
            install_errmsg = (
                "ocrmac is not correctly installed. "
                "Please install it via `pip install ocrmac` to use this OCR engine. "
@ -121,3 +140,7 @@ class OcrMacModel(BaseOcrModel):
                    self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)

                yield page
+
+    @classmethod
+    def get_options_type(cls) -> Type[OcrOptions]:
+        return OcrMacOptions
--- a/docling/models/page_preprocessing_model.py
+++ b/docling/models/page_preprocessing_model.py
@ -63,7 +63,13 @@ class PagePreprocessingModel(BasePageModel):
        def draw_text_boxes(image, cells, show: bool = False):
            draw = ImageDraw.Draw(image)
            for c in cells:
-                x0, y0, x1, y1 = c.bbox.as_tuple()
+                x0, y0, x1, y1 = (
+                    c.to_bounding_box().l,
+                    c.to_bounding_box().t,
+                    c.to_bounding_box().r,
+                    c.to_bounding_box().b,
+                )
+
                draw.rectangle([(x0, y0), (x1, y1)], outline="red")
            if show:
                image.show()
--- a/docling/models/picture_description_api_model.py
+++ b/docling/models/picture_description_api_model.py
@ -1,13 +1,18 @@
 import base64
 import io
 import logging
-from typing import Iterable, List, Optional
+from pathlib import Path
+from typing import Iterable, List, Optional, Type, Union

 import requests
 from PIL import Image
 from pydantic import BaseModel, ConfigDict

-from docling.datamodel.pipeline_options import PictureDescriptionApiOptions
+from docling.datamodel.pipeline_options import (
+    AcceleratorOptions,
+    PictureDescriptionApiOptions,
+    PictureDescriptionBaseOptions,
+)
 from docling.exceptions import OperationNotAllowed
 from docling.models.picture_description_base_model import PictureDescriptionBaseModel

@ -46,13 +51,25 @@ class ApiResponse(BaseModel):
 class PictureDescriptionApiModel(PictureDescriptionBaseModel):
    # elements_batch_size = 4

+    @classmethod
+    def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
+        return PictureDescriptionApiOptions
+
    def __init__(
        self,
        enabled: bool,
        enable_remote_services: bool,
+        artifacts_path: Optional[Union[Path, str]],
        options: PictureDescriptionApiOptions,
+        accelerator_options: AcceleratorOptions,
    ):
-        super().__init__(enabled=enabled, options=options)
+        super().__init__(
+            enabled=enabled,
+            enable_remote_services=enable_remote_services,
+            artifacts_path=artifacts_path,
+            options=options,
+            accelerator_options=accelerator_options,
+        )
        self.options: PictureDescriptionApiOptions

        if self.enabled:
--- a/docling/models/picture_description_base_model.py
+++ b/docling/models/picture_description_base_model.py
@ -1,6 +1,7 @@
 import logging
+from abc import abstractmethod
 from pathlib import Path
-from typing import Any, Iterable, List, Optional, Union
+from typing import Any, Iterable, List, Optional, Type, Union

 from docling_core.types.doc import (
    DoclingDocument,
@ -13,20 +14,30 @@ from docling_core.types.doc.document import (  # TODO: move import to docling_co
 )
 from PIL import Image

-from docling.datamodel.pipeline_options import PictureDescriptionBaseOptions
+from docling.datamodel.pipeline_options import (
+    AcceleratorOptions,
+    PictureDescriptionBaseOptions,
+)
 from docling.models.base_model import (
    BaseItemAndImageEnrichmentModel,
+    BaseModelWithOptions,
    ItemAndImageEnrichmentElement,
 )


-class PictureDescriptionBaseModel(BaseItemAndImageEnrichmentModel):
+class PictureDescriptionBaseModel(
+    BaseItemAndImageEnrichmentModel, BaseModelWithOptions
+):
    images_scale: float = 2.0

    def __init__(
        self,
+        *,
        enabled: bool,
+        enable_remote_services: bool,
+        artifacts_path: Optional[Union[Path, str]],
        options: PictureDescriptionBaseOptions,
+        accelerator_options: AcceleratorOptions,
    ):
        self.enabled = enabled
        self.options = options
@ -62,3 +73,8 @@ class PictureDescriptionBaseModel(BaseItemAndImageEnrichmentModel):
                PictureDescriptionData(text=output, provenance=self.provenance)
            )
            yield item
+
+    @classmethod
+    @abstractmethod
+    def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
+        pass
--- a/docling/models/picture_description_vlm_model.py
+++ b/docling/models/picture_description_vlm_model.py
@ -1,10 +1,11 @@
 from pathlib import Path
-from typing import Iterable, Optional, Union
+from typing import Iterable, Optional, Type, Union

 from PIL import Image

 from docling.datamodel.pipeline_options import (
    AcceleratorOptions,
+    PictureDescriptionBaseOptions,
    PictureDescriptionVlmOptions,
 )
 from docling.models.picture_description_base_model import PictureDescriptionBaseModel
@ -13,14 +14,25 @@ from docling.utils.accelerator_utils import decide_device

 class PictureDescriptionVlmModel(PictureDescriptionBaseModel):

+    @classmethod
+    def get_options_type(cls) -> Type[PictureDescriptionBaseOptions]:
+        return PictureDescriptionVlmOptions
+
    def __init__(
        self,
        enabled: bool,
+        enable_remote_services: bool,
        artifacts_path: Optional[Union[Path, str]],
        options: PictureDescriptionVlmOptions,
        accelerator_options: AcceleratorOptions,
    ):
-        super().__init__(enabled=enabled, options=options)
+        super().__init__(
+            enabled=enabled,
+            enable_remote_services=enable_remote_services,
+            artifacts_path=artifacts_path,
+            options=options,
+            accelerator_options=accelerator_options,
+        )
        self.options: PictureDescriptionVlmOptions

        if self.enabled:
--- a/docling/models/plugins/init.py
+++ b/docling/models/plugins/init.py
--- a/docling/models/plugins/defaults.py
+++ b/docling/models/plugins/defaults.py
@ -0,0 +1,28 @@
+from docling.models.easyocr_model import EasyOcrModel
+from docling.models.ocr_mac_model import OcrMacModel
+from docling.models.picture_description_api_model import PictureDescriptionApiModel
+from docling.models.picture_description_vlm_model import PictureDescriptionVlmModel
+from docling.models.rapid_ocr_model import RapidOcrModel
+from docling.models.tesseract_ocr_cli_model import TesseractOcrCliModel
+from docling.models.tesseract_ocr_model import TesseractOcrModel
+
+
+def ocr_engines():
+    return {
+        "ocr_engines": [
+            EasyOcrModel,
+            OcrMacModel,
+            RapidOcrModel,
+            TesseractOcrModel,
+            TesseractOcrCliModel,
+        ]
+    }
+
+
+def picture_description():
+    return {
+        "picture_description": [
+            PictureDescriptionVlmModel,
+            PictureDescriptionApiModel,
+        ]
+    }
--- a/docling/models/rapid_ocr_model.py
+++ b/docling/models/rapid_ocr_model.py
@ -1,5 +1,6 @@
 import logging
-from typing import Iterable
+from pathlib import Path
+from typing import Iterable, Optional, Type

 import numpy
 from docling_core.types.doc import BoundingBox, CoordOrigin
@ -10,6 +11,7 @@ from docling.datamodel.document import ConversionResult
 from docling.datamodel.pipeline_options import (
    AcceleratorDevice,
    AcceleratorOptions,
+    OcrOptions,
    RapidOcrOptions,
 )
 from docling.datamodel.settings import settings
@ -24,10 +26,16 @@ class RapidOcrModel(BaseOcrModel):
    def __init__(
        self,
        enabled: bool,
+        artifacts_path: Optional[Path],
        options: RapidOcrOptions,
        accelerator_options: AcceleratorOptions,
    ):
-        super().__init__(enabled=enabled, options=options)
+        super().__init__(
+            enabled=enabled,
+            artifacts_path=artifacts_path,
+            options=options,
+            accelerator_options=accelerator_options,
+        )
        self.options: RapidOcrOptions

        self.scale = 3  # multiplier for 72 dpi == 216 dpi.
@ -135,3 +143,7 @@ class RapidOcrModel(BaseOcrModel):
                    self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)

                yield page
+
+    @classmethod
+    def get_options_type(cls) -> Type[OcrOptions]:
+        return RapidOcrOptions
--- a/docling/models/tesseract_ocr_cli_model.py
+++ b/docling/models/tesseract_ocr_cli_model.py
@ -3,8 +3,9 @@ import io
 import logging
 import os
 import tempfile
+from pathlib import Path
 from subprocess import DEVNULL, PIPE, Popen
-from typing import Iterable, List, Optional, Tuple
+from typing import Iterable, List, Optional, Tuple, Type

 import pandas as pd
 from docling_core.types.doc import BoundingBox, CoordOrigin
@ -12,7 +13,11 @@ from docling_core.types.doc.page import BoundingRectangle, TextCell

 from docling.datamodel.base_models import Page
 from docling.datamodel.document import ConversionResult
-from docling.datamodel.pipeline_options import TesseractCliOcrOptions
+from docling.datamodel.pipeline_options import (
+    AcceleratorOptions,
+    OcrOptions,
+    TesseractCliOcrOptions,
+)
 from docling.datamodel.settings import settings
 from docling.models.base_ocr_model import BaseOcrModel
 from docling.utils.ocr_utils import map_tesseract_script
@ -22,8 +27,19 @@ _log = logging.getLogger(__name__)


 class TesseractOcrCliModel(BaseOcrModel):
-    def __init__(self, enabled: bool, options: TesseractCliOcrOptions):
-        super().__init__(enabled=enabled, options=options)
+    def __init__(
+        self,
+        enabled: bool,
+        artifacts_path: Optional[Path],
+        options: TesseractCliOcrOptions,
+        accelerator_options: AcceleratorOptions,
+    ):
+        super().__init__(
+            enabled=enabled,
+            artifacts_path=artifacts_path,
+            options=options,
+            accelerator_options=accelerator_options,
+        )
        self.options: TesseractCliOcrOptions

        self.scale = 3  # multiplier for 72 dpi == 216 dpi.
@ -257,3 +273,7 @@ class TesseractOcrCliModel(BaseOcrModel):
                    self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)

                yield page
+
+    @classmethod
+    def get_options_type(cls) -> Type[OcrOptions]:
+        return TesseractCliOcrOptions
--- a/docling/models/tesseract_ocr_model.py
+++ b/docling/models/tesseract_ocr_model.py
@ -1,12 +1,17 @@
 import logging
-from typing import Iterable
+from pathlib import Path
+from typing import Iterable, Optional, Type

 from docling_core.types.doc import BoundingBox, CoordOrigin
 from docling_core.types.doc.page import BoundingRectangle, TextCell

 from docling.datamodel.base_models import Page
 from docling.datamodel.document import ConversionResult
-from docling.datamodel.pipeline_options import TesseractOcrOptions
+from docling.datamodel.pipeline_options import (
+    AcceleratorOptions,
+    OcrOptions,
+    TesseractOcrOptions,
+)
 from docling.datamodel.settings import settings
 from docling.models.base_ocr_model import BaseOcrModel
 from docling.utils.ocr_utils import map_tesseract_script
@ -16,8 +21,19 @@ _log = logging.getLogger(__name__)


 class TesseractOcrModel(BaseOcrModel):
-    def __init__(self, enabled: bool, options: TesseractOcrOptions):
-        super().__init__(enabled=enabled, options=options)
+    def __init__(
+        self,
+        enabled: bool,
+        artifacts_path: Optional[Path],
+        options: TesseractOcrOptions,
+        accelerator_options: AcceleratorOptions,
+    ):
+        super().__init__(
+            enabled=enabled,
+            artifacts_path=artifacts_path,
+            options=options,
+            accelerator_options=accelerator_options,
+        )
        self.options: TesseractOcrOptions

        self.scale = 3  # multiplier for 72 dpi == 216 dpi.
@ -200,3 +216,7 @@ class TesseractOcrModel(BaseOcrModel):
                    self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)

                yield page
+
+    @classmethod
+    def get_options_type(cls) -> Type[OcrOptions]:
+        return TesseractOcrOptions
--- a/docling/pipeline/standard_pdf_pipeline.py
+++ b/docling/pipeline/standard_pdf_pipeline.py
@ -10,16 +10,7 @@ from docling.backend.abstract_backend import AbstractDocumentBackend
 from docling.backend.pdf_backend import PdfDocumentBackend
 from docling.datamodel.base_models import AssembledUnit, Page
 from docling.datamodel.document import ConversionResult
-from docling.datamodel.pipeline_options import (
-    EasyOcrOptions,
-    OcrMacOptions,
-    PdfPipelineOptions,
-    PictureDescriptionApiOptions,
-    PictureDescriptionVlmOptions,
-    RapidOcrOptions,
-    TesseractCliOcrOptions,
-    TesseractOcrOptions,
-)
+from docling.datamodel.pipeline_options import PdfPipelineOptions
 from docling.datamodel.settings import settings
 from docling.models.base_ocr_model import BaseOcrModel
 from docling.models.code_formula_model import CodeFormulaModel, CodeFormulaModelOptions
@ -27,22 +18,16 @@ from docling.models.document_picture_classifier import (
    DocumentPictureClassifier,
    DocumentPictureClassifierOptions,
 )
-from docling.models.easyocr_model import EasyOcrModel
+from docling.models.factories import get_ocr_factory, get_picture_description_factory
 from docling.models.layout_model import LayoutModel
-from docling.models.ocr_mac_model import OcrMacModel
 from docling.models.page_assemble_model import PageAssembleModel, PageAssembleOptions
 from docling.models.page_preprocessing_model import (
    PagePreprocessingModel,
    PagePreprocessingOptions,
 )
-from docling.models.picture_description_api_model import PictureDescriptionApiModel
 from docling.models.picture_description_base_model import PictureDescriptionBaseModel
-from docling.models.picture_description_vlm_model import PictureDescriptionVlmModel
-from docling.models.rapid_ocr_model import RapidOcrModel
 from docling.models.readingorder_model import ReadingOrderModel, ReadingOrderOptions
 from docling.models.table_structure_model import TableStructureModel
-from docling.models.tesseract_ocr_cli_model import TesseractOcrCliModel
-from docling.models.tesseract_ocr_model import TesseractOcrModel
 from docling.pipeline.base_pipeline import PaginatedPipeline
 from docling.utils.model_downloader import download_models
 from docling.utils.profiling import ProfilingScope, TimeRecorder
@ -78,10 +63,7 @@ class StandardPdfPipeline(PaginatedPipeline):

        self.glm_model = ReadingOrderModel(options=ReadingOrderOptions())

-        if (ocr_model := self.get_ocr_model(artifacts_path=artifacts_path)) is None:
-            raise RuntimeError(
-                f"The specified OCR kind is not supported: {pipeline_options.ocr_options.kind}."
-            )
+        ocr_model = self.get_ocr_model(artifacts_path=artifacts_path)

        self.build_pipe = [
            # Pre-processing
@ -164,66 +146,30 @@ class StandardPdfPipeline(PaginatedPipeline):
        output_dir = download_models(output_dir=local_dir, force=force, progress=False)
        return output_dir

-    def get_ocr_model(
-        self, artifacts_path: Optional[Path] = None
-    ) -> Optional[BaseOcrModel]:
-        if isinstance(self.pipeline_options.ocr_options, EasyOcrOptions):
-            return EasyOcrModel(
+    def get_ocr_model(self, artifacts_path: Optional[Path] = None) -> BaseOcrModel:
+        factory = get_ocr_factory(
+            allow_external_plugins=self.pipeline_options.allow_external_plugins
+        )
+        return factory.create_instance(
+            options=self.pipeline_options.ocr_options,
            enabled=self.pipeline_options.do_ocr,
            artifacts_path=artifacts_path,
-                options=self.pipeline_options.ocr_options,
            accelerator_options=self.pipeline_options.accelerator_options,
        )
-        elif isinstance(self.pipeline_options.ocr_options, TesseractCliOcrOptions):
-            return TesseractOcrCliModel(
-                enabled=self.pipeline_options.do_ocr,
-                options=self.pipeline_options.ocr_options,
-            )
-        elif isinstance(self.pipeline_options.ocr_options, TesseractOcrOptions):
-            return TesseractOcrModel(
-                enabled=self.pipeline_options.do_ocr,
-                options=self.pipeline_options.ocr_options,
-            )
-        elif isinstance(self.pipeline_options.ocr_options, RapidOcrOptions):
-            return RapidOcrModel(
-                enabled=self.pipeline_options.do_ocr,
-                options=self.pipeline_options.ocr_options,
-                accelerator_options=self.pipeline_options.accelerator_options,
-            )
-        elif isinstance(self.pipeline_options.ocr_options, OcrMacOptions):
-            if "darwin" != sys.platform:
-                raise RuntimeError(
-                    f"The specified OCR type is only supported on Mac: {self.pipeline_options.ocr_options.kind}."
-                )
-            return OcrMacModel(
-                enabled=self.pipeline_options.do_ocr,
-                options=self.pipeline_options.ocr_options,
-            )
-        return None

    def get_picture_description_model(
        self, artifacts_path: Optional[Path] = None
    ) -> Optional[PictureDescriptionBaseModel]:
-        if isinstance(
-            self.pipeline_options.picture_description_options,
-            PictureDescriptionApiOptions,
-        ):
-            return PictureDescriptionApiModel(
+        factory = get_picture_description_factory(
+            allow_external_plugins=self.pipeline_options.allow_external_plugins
+        )
+        return factory.create_instance(
+            options=self.pipeline_options.picture_description_options,
            enabled=self.pipeline_options.do_picture_description,
            enable_remote_services=self.pipeline_options.enable_remote_services,
-                options=self.pipeline_options.picture_description_options,
-            )
-        elif isinstance(
-            self.pipeline_options.picture_description_options,
-            PictureDescriptionVlmOptions,
-        ):
-            return PictureDescriptionVlmModel(
-                enabled=self.pipeline_options.do_picture_description,
            artifacts_path=artifacts_path,
-                options=self.pipeline_options.picture_description_options,
            accelerator_options=self.pipeline_options.accelerator_options,
        )
-        return None

    def initialize_page(self, conv_res: ConversionResult, page: Page) -> Page:
        with TimeRecorder(conv_res, "page_init"):
--- a/docling/pipeline/vlm_pipeline.py
+++ b/docling/pipeline/vlm_pipeline.py
@ -1,30 +1,13 @@
-import itertools
 import logging
-import re
 import warnings
 from io import BytesIO
-
-# from io import BytesIO
 from pathlib import Path
-from typing import Optional
+from typing import List, Optional, Union, cast

-from docling_core.types import DoclingDocument
-from docling_core.types.doc import (
-    BoundingBox,
-    DocItem,
-    DocItemLabel,
-    DoclingDocument,
-    GroupLabel,
-    ImageRef,
-    ImageRefMode,
-    PictureItem,
-    ProvenanceItem,
-    Size,
-    TableCell,
-    TableData,
-    TableItem,
-)
-from docling_core.types.doc.tokens import DocumentToken, TableToken
+# from docling_core.types import DoclingDocument
+from docling_core.types.doc import BoundingBox, DocItem, ImageRef, PictureItem, TextItem
+from docling_core.types.doc.document import DocTagsDocument
+from PIL import Image as PILImage

 from docling.backend.abstract_backend import AbstractDocumentBackend
 from docling.backend.md_backend import MarkdownDocumentBackend
@ -32,11 +15,12 @@ from docling.backend.pdf_backend import PdfDocumentBackend
 from docling.datamodel.base_models import InputFormat, Page
 from docling.datamodel.document import ConversionResult, InputDocument
 from docling.datamodel.pipeline_options import (
-    PdfPipelineOptions,
+    InferenceFramework,
    ResponseFormat,
    VlmPipelineOptions,
 )
 from docling.datamodel.settings import settings
+from docling.models.hf_mlx_model import HuggingFaceMlxModel
 from docling.models.hf_vlm_model import HuggingFaceVlmModel
 from docling.pipeline.base_pipeline import PaginatedPipeline
 from docling.utils.profiling import ProfilingScope, TimeRecorder
@ -50,12 +34,6 @@ class VlmPipeline(PaginatedPipeline):
        super().__init__(pipeline_options)
        self.keep_backend = True

-        warnings.warn(
-            "The VlmPipeline is currently experimental and may change in upcoming versions without notice.",
-            category=UserWarning,
-            stacklevel=2,
-        )
-
        self.pipeline_options: VlmPipelineOptions

        artifacts_path: Optional[Path] = None
@ -79,6 +57,19 @@ class VlmPipeline(PaginatedPipeline):

        self.keep_images = self.pipeline_options.generate_page_images

+        if (
+            self.pipeline_options.vlm_options.inference_framework
+            == InferenceFramework.MLX
+        ):
+            self.build_pipe = [
+                HuggingFaceMlxModel(
+                    enabled=True,  # must be always enabled for this pipeline to make sense.
+                    artifacts_path=artifacts_path,
+                    accelerator_options=pipeline_options.accelerator_options,
+                    vlm_options=self.pipeline_options.vlm_options,
+                ),
+            ]
+        else:
            self.build_pipe = [
                HuggingFaceVlmModel(
                    enabled=True,  # must be always enabled for this pipeline to make sense.
@ -100,6 +91,17 @@ class VlmPipeline(PaginatedPipeline):

        return page

+    def extract_text_from_backend(
+        self, page: Page, bbox: Union[BoundingBox, None]
+    ) -> str:
+        # Convert bounding box normalized to 0-100 into page coordinates for cropping
+        text = ""
+        if bbox:
+            if page.size:
+                if page._backend:
+                    text = page._backend.get_text_in_rect(bbox)
+        return text
+
    def _assemble_document(self, conv_res: ConversionResult) -> ConversionResult:
        with TimeRecorder(conv_res, "doc_assemble", scope=ProfilingScope.DOCUMENT):

@ -107,7 +109,45 @@ class VlmPipeline(PaginatedPipeline):
                self.pipeline_options.vlm_options.response_format
                == ResponseFormat.DOCTAGS
            ):
-                conv_res.document = self._turn_tags_into_doc(conv_res.pages)
+                doctags_list = []
+                image_list = []
+                for page in conv_res.pages:
+                    predicted_doctags = ""
+                    img = PILImage.new("RGB", (1, 1), "rgb(255,255,255)")
+                    if page.predictions.vlm_response:
+                        predicted_doctags = page.predictions.vlm_response.text
+                    if page.image:
+                        img = page.image
+                    image_list.append(img)
+                    doctags_list.append(predicted_doctags)
+
+                doctags_list_c = cast(List[Union[Path, str]], doctags_list)
+                image_list_c = cast(List[Union[Path, PILImage.Image]], image_list)
+                doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(
+                    doctags_list_c, image_list_c
+                )
+                conv_res.document.load_from_doctags(doctags_doc)
+
+                # If forced backend text, replace model predicted text with backend one
+                if page.size:
+                    if self.force_backend_text:
+                        scale = self.pipeline_options.images_scale
+                        for element, _level in conv_res.document.iterate_items():
+                            if (
+                                not isinstance(element, TextItem)
+                                or len(element.prov) == 0
+                            ):
+                                continue
+                            crop_bbox = (
+                                element.prov[0]
+                                .bbox.scaled(scale=scale)
+                                .to_top_left_origin(
+                                    page_height=page.size.height * scale
+                                )
+                            )
+                            txt = self.extract_text_from_backend(page, crop_bbox)
+                            element.text = txt
+                            element.orig = txt
            elif (
                self.pipeline_options.vlm_options.response_format
                == ResponseFormat.MARKDOWN
@ -165,366 +205,6 @@ class VlmPipeline(PaginatedPipeline):
        )
        return backend.convert()

-    def _turn_tags_into_doc(self, pages: list[Page]) -> DoclingDocument:
-        ###############################################
-        # Tag definitions and color mappings
-        ###############################################
-
-        # Maps the recognized tag to a Docling label.
-        # Code items will be given DocItemLabel.CODE
-        tag_to_doclabel = {
-            "title": DocItemLabel.TITLE,
-            "document_index": DocItemLabel.DOCUMENT_INDEX,
-            "otsl": DocItemLabel.TABLE,
-            "section_header_level_1": DocItemLabel.SECTION_HEADER,
-            "checkbox_selected": DocItemLabel.CHECKBOX_SELECTED,
-            "checkbox_unselected": DocItemLabel.CHECKBOX_UNSELECTED,
-            "text": DocItemLabel.TEXT,
-            "page_header": DocItemLabel.PAGE_HEADER,
-            "page_footer": DocItemLabel.PAGE_FOOTER,
-            "formula": DocItemLabel.FORMULA,
-            "caption": DocItemLabel.CAPTION,
-            "picture": DocItemLabel.PICTURE,
-            "list_item": DocItemLabel.LIST_ITEM,
-            "footnote": DocItemLabel.FOOTNOTE,
-            "code": DocItemLabel.CODE,
-        }
-
-        # Maps each tag to an associated bounding box color.
-        tag_to_color = {
-            "title": "blue",
-            "document_index": "darkblue",
-            "otsl": "green",
-            "section_header_level_1": "purple",
-            "checkbox_selected": "black",
-            "checkbox_unselected": "gray",
-            "text": "red",
-            "page_header": "orange",
-            "page_footer": "cyan",
-            "formula": "pink",
-            "caption": "magenta",
-            "picture": "yellow",
-            "list_item": "brown",
-            "footnote": "darkred",
-            "code": "lightblue",
-        }
-
-        def extract_bounding_box(text_chunk: str) -> Optional[BoundingBox]:
-            """Extracts <loc_...> bounding box coords from the chunk, normalized by / 500."""
-            coords = re.findall(r"<loc_(\d+)>", text_chunk)
-            if len(coords) == 4:
-                l, t, r, b = map(float, coords)
-                return BoundingBox(l=l / 500, t=t / 500, r=r / 500, b=b / 500)
-            return None
-
-        def extract_inner_text(text_chunk: str) -> str:
-            """Strips all <...> tags inside the chunk to get the raw text content."""
-            return re.sub(r"<.*?>", "", text_chunk, flags=re.DOTALL).strip()
-
-        def extract_text_from_backend(page: Page, bbox: BoundingBox | None) -> str:
-            # Convert bounding box normalized to 0-100 into page coordinates for cropping
-            text = ""
-            if bbox:
-                if page.size:
-                    bbox.l = bbox.l * page.size.width
-                    bbox.t = bbox.t * page.size.height
-                    bbox.r = bbox.r * page.size.width
-                    bbox.b = bbox.b * page.size.height
-                    if page._backend:
-                        text = page._backend.get_text_in_rect(bbox)
-            return text
-
-        def otsl_parse_texts(texts, tokens):
-            split_word = TableToken.OTSL_NL.value
-            split_row_tokens = [
-                list(y)
-                for x, y in itertools.groupby(tokens, lambda z: z == split_word)
-                if not x
-            ]
-            table_cells = []
-            r_idx = 0
-            c_idx = 0
-
-            def count_right(tokens, c_idx, r_idx, which_tokens):
-                span = 0
-                c_idx_iter = c_idx
-                while tokens[r_idx][c_idx_iter] in which_tokens:
-                    c_idx_iter += 1
-                    span += 1
-                    if c_idx_iter >= len(tokens[r_idx]):
-                        return span
-                return span
-
-            def count_down(tokens, c_idx, r_idx, which_tokens):
-                span = 0
-                r_idx_iter = r_idx
-                while tokens[r_idx_iter][c_idx] in which_tokens:
-                    r_idx_iter += 1
-                    span += 1
-                    if r_idx_iter >= len(tokens):
-                        return span
-                return span
-
-            for i, text in enumerate(texts):
-                cell_text = ""
-                if text in [
-                    TableToken.OTSL_FCEL.value,
-                    TableToken.OTSL_ECEL.value,
-                    TableToken.OTSL_CHED.value,
-                    TableToken.OTSL_RHED.value,
-                    TableToken.OTSL_SROW.value,
-                ]:
-                    row_span = 1
-                    col_span = 1
-                    right_offset = 1
-                    if text != TableToken.OTSL_ECEL.value:
-                        cell_text = texts[i + 1]
-                        right_offset = 2
-
-                    # Check next element(s) for lcel / ucel / xcel, set properly row_span, col_span
-                    next_right_cell = ""
-                    if i + right_offset < len(texts):
-                        next_right_cell = texts[i + right_offset]
-
-                    next_bottom_cell = ""
-                    if r_idx + 1 < len(split_row_tokens):
-                        if c_idx < len(split_row_tokens[r_idx + 1]):
-                            next_bottom_cell = split_row_tokens[r_idx + 1][c_idx]
-
-                    if next_right_cell in [
-                        TableToken.OTSL_LCEL.value,
-                        TableToken.OTSL_XCEL.value,
-                    ]:
-                        # we have horisontal spanning cell or 2d spanning cell
-                        col_span += count_right(
-                            split_row_tokens,
-                            c_idx + 1,
-                            r_idx,
-                            [TableToken.OTSL_LCEL.value, TableToken.OTSL_XCEL.value],
-                        )
-                    if next_bottom_cell in [
-                        TableToken.OTSL_UCEL.value,
-                        TableToken.OTSL_XCEL.value,
-                    ]:
-                        # we have a vertical spanning cell or 2d spanning cell
-                        row_span += count_down(
-                            split_row_tokens,
-                            c_idx,
-                            r_idx + 1,
-                            [TableToken.OTSL_UCEL.value, TableToken.OTSL_XCEL.value],
-                        )
-
-                    table_cells.append(
-                        TableCell(
-                            text=cell_text.strip(),
-                            row_span=row_span,
-                            col_span=col_span,
-                            start_row_offset_idx=r_idx,
-                            end_row_offset_idx=r_idx + row_span,
-                            start_col_offset_idx=c_idx,
-                            end_col_offset_idx=c_idx + col_span,
-                        )
-                    )
-                if text in [
-                    TableToken.OTSL_FCEL.value,
-                    TableToken.OTSL_ECEL.value,
-                    TableToken.OTSL_CHED.value,
-                    TableToken.OTSL_RHED.value,
-                    TableToken.OTSL_SROW.value,
-                    TableToken.OTSL_LCEL.value,
-                    TableToken.OTSL_UCEL.value,
-                    TableToken.OTSL_XCEL.value,
-                ]:
-                    c_idx += 1
-                if text == TableToken.OTSL_NL.value:
-                    r_idx += 1
-                    c_idx = 0
-            return table_cells, split_row_tokens
-
-        def otsl_extract_tokens_and_text(s: str):
-            # Pattern to match anything enclosed by < > (including the angle brackets themselves)
-            pattern = r"(<[^>]+>)"
-            # Find all tokens (e.g. "<otsl>", "<loc_140>", etc.)
-            tokens = re.findall(pattern, s)
-            # Remove any tokens that start with "<loc_"
-            tokens = [
-                token
-                for token in tokens
-                if not (
-                    token.startswith(rf"<{DocumentToken.LOC.value}")
-                    or token
-                    in [
-                        rf"<{DocumentToken.OTSL.value}>",
-                        rf"</{DocumentToken.OTSL.value}>",
-                    ]
-                )
-            ]
-            # Split the string by those tokens to get the in-between text
-            text_parts = re.split(pattern, s)
-            text_parts = [
-                token
-                for token in text_parts
-                if not (
-                    token.startswith(rf"<{DocumentToken.LOC.value}")
-                    or token
-                    in [
-                        rf"<{DocumentToken.OTSL.value}>",
-                        rf"</{DocumentToken.OTSL.value}>",
-                    ]
-                )
-            ]
-            # Remove any empty or purely whitespace strings from text_parts
-            text_parts = [part for part in text_parts if part.strip()]
-
-            return tokens, text_parts
-
-        def parse_table_content(otsl_content: str) -> TableData:
-            tokens, mixed_texts = otsl_extract_tokens_and_text(otsl_content)
-            table_cells, split_row_tokens = otsl_parse_texts(mixed_texts, tokens)
-
-            return TableData(
-                num_rows=len(split_row_tokens),
-                num_cols=(
-                    max(len(row) for row in split_row_tokens) if split_row_tokens else 0
-                ),
-                table_cells=table_cells,
-            )
-
-        doc = DoclingDocument(name="Document")
-        for pg_idx, page in enumerate(pages):
-            xml_content = ""
-            predicted_text = ""
-            if page.predictions.vlm_response:
-                predicted_text = page.predictions.vlm_response.text
-            image = page.image
-
-            page_no = pg_idx + 1
-            bounding_boxes = []
-
-            if page.size:
-                pg_width = page.size.width
-                pg_height = page.size.height
-                size = Size(width=pg_width, height=pg_height)
-                parent_page = doc.add_page(page_no=page_no, size=size)
-
-            """
-            1. Finds all <tag>...</tag> blocks in the entire string (multi-line friendly) in the order they appear.
-            2. For each chunk, extracts bounding box (if any) and inner text.
-            3. Adds the item to a DoclingDocument structure with the right label.
-            4. Tracks bounding boxes + color in a separate list for later visualization.
-            """
-
-            # Regex for all recognized tags
-            tag_pattern = (
-                rf"<(?P<tag>{DocItemLabel.TITLE}|{DocItemLabel.DOCUMENT_INDEX}|"
-                rf"{DocItemLabel.CHECKBOX_UNSELECTED}|{DocItemLabel.CHECKBOX_SELECTED}|"
-                rf"{DocItemLabel.TEXT}|{DocItemLabel.PAGE_HEADER}|"
-                rf"{DocItemLabel.PAGE_FOOTER}|{DocItemLabel.FORMULA}|"
-                rf"{DocItemLabel.CAPTION}|{DocItemLabel.PICTURE}|"
-                rf"{DocItemLabel.LIST_ITEM}|{DocItemLabel.FOOTNOTE}|{DocItemLabel.CODE}|"
-                rf"{DocItemLabel.SECTION_HEADER}_level_1|{DocumentToken.OTSL.value})>.*?</(?P=tag)>"
-            )
-
-            # DocumentToken.OTSL
-            pattern = re.compile(tag_pattern, re.DOTALL)
-
-            # Go through each match in order
-            for match in pattern.finditer(predicted_text):
-                full_chunk = match.group(0)
-                tag_name = match.group("tag")
-
-                bbox = extract_bounding_box(full_chunk)
-                doc_label = tag_to_doclabel.get(tag_name, DocItemLabel.PARAGRAPH)
-                color = tag_to_color.get(tag_name, "white")
-
-                # Store bounding box + color
-                if bbox:
-                    bounding_boxes.append((bbox, color))
-
-                if tag_name == DocumentToken.OTSL.value:
-                    table_data = parse_table_content(full_chunk)
-                    bbox = extract_bounding_box(full_chunk)
-
-                    if bbox:
-                        prov = ProvenanceItem(
-                            bbox=bbox.resize_by_scale(pg_width, pg_height),
-                            charspan=(0, 0),
-                            page_no=page_no,
-                        )
-                        doc.add_table(data=table_data, prov=prov)
-                    else:
-                        doc.add_table(data=table_data)
-
-                elif tag_name == DocItemLabel.PICTURE:
-                    text_caption_content = extract_inner_text(full_chunk)
-                    if image:
-                        if bbox:
-                            im_width, im_height = image.size
-
-                            crop_box = (
-                                int(bbox.l * im_width),
-                                int(bbox.t * im_height),
-                                int(bbox.r * im_width),
-                                int(bbox.b * im_height),
-                            )
-                            cropped_image = image.crop(crop_box)
-                            pic = doc.add_picture(
-                                parent=None,
-                                image=ImageRef.from_pil(image=cropped_image, dpi=72),
-                                prov=(
-                                    ProvenanceItem(
-                                        bbox=bbox.resize_by_scale(pg_width, pg_height),
-                                        charspan=(0, 0),
-                                        page_no=page_no,
-                                    )
-                                ),
-                            )
-                            # If there is a caption to an image, add it as well
-                            if len(text_caption_content) > 0:
-                                caption_item = doc.add_text(
-                                    label=DocItemLabel.CAPTION,
-                                    text=text_caption_content,
-                                    parent=None,
-                                )
-                                pic.captions.append(caption_item.get_ref())
-                    else:
-                        if bbox:
-                            # In case we don't have access to an binary of an image
-                            doc.add_picture(
-                                parent=None,
-                                prov=ProvenanceItem(
-                                    bbox=bbox, charspan=(0, 0), page_no=page_no
-                                ),
-                            )
-                            # If there is a caption to an image, add it as well
-                            if len(text_caption_content) > 0:
-                                caption_item = doc.add_text(
-                                    label=DocItemLabel.CAPTION,
-                                    text=text_caption_content,
-                                    parent=None,
-                                )
-                                pic.captions.append(caption_item.get_ref())
-                else:
-                    # For everything else, treat as text
-                    if self.force_backend_text:
-                        text_content = extract_text_from_backend(page, bbox)
-                    else:
-                        text_content = extract_inner_text(full_chunk)
-                    doc.add_text(
-                        label=doc_label,
-                        text=text_content,
-                        prov=(
-                            ProvenanceItem(
-                                bbox=bbox.resize_by_scale(pg_width, pg_height),
-                                charspan=(0, len(text_content)),
-                                page_no=page_no,
-                            )
-                            if bbox
-                            else None
-                        ),
-                    )
-        return doc
-
    @classmethod
    def get_default_options(cls) -> VlmPipelineOptions:
        return VlmPipelineOptions()
--- a/docs/examples/batch_convert.py
+++ b/docs/examples/batch_convert.py
@ -154,7 +154,7 @@ def main():

    conv_results = doc_converter.convert_all(
        input_doc_paths,
-        raises_on_error=True,  # to let conversion run through all and examine results at the end
+        raises_on_error=False,  # to let conversion run through all and examine results at the end
    )
    success_count, partial_success_count, failure_count = export_documents(
        conv_results, output_dir=Path("scratch")
--- a/docs/examples/minimal_vlm_pipeline.py
+++ b/docs/examples/minimal_vlm_pipeline.py
@ -10,13 +10,15 @@ from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    granite_vision_vlm_conversion_options,
    smoldocling_vlm_conversion_options,
+    smoldocling_vlm_mlx_conversion_options,
 )
 from docling.datamodel.settings import settings
 from docling.document_converter import DocumentConverter, PdfFormatOption
 from docling.pipeline.vlm_pipeline import VlmPipeline

 sources = [
-    "tests/data/2305.03393v1-pg9-img.png",
+    # "tests/data/2305.03393v1-pg9-img.png",
+    "tests/data/pdf/2305.03393v1-pg9.pdf",
 ]

 ## Use experimental VlmPipeline
@ -29,7 +31,10 @@ pipeline_options.force_backend_text = False
 # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True

 ## Pick a VLM model. We choose SmolDocling-256M by default
-pipeline_options.vlm_options = smoldocling_vlm_conversion_options
+# pipeline_options.vlm_options = smoldocling_vlm_conversion_options
+
+## Pick a VLM model. Fast Apple Silicon friendly implementation for SmolDocling-256M via MLX
+pipeline_options.vlm_options = smoldocling_vlm_mlx_conversion_options

 ## Alternative VLM models:
 # pipeline_options.vlm_options = granite_vision_vlm_conversion_options
@ -63,9 +68,6 @@ for source in sources:

    res = converter.convert(source)

-    print("------------------------------------------------")
-    print("MD:")
-    print("------------------------------------------------")
    print("")
    print(res.document.export_to_markdown())

@ -83,8 +85,17 @@ for source in sources:
    with (out_path / f"{res.input.file.stem}.json").open("w") as fp:
        fp.write(json.dumps(res.document.export_to_dict()))

-    pg_num = res.document.num_pages()
+    res.document.save_as_json(
+        out_path / f"{res.input.file.stem}.md",
+        image_mode=ImageRefMode.PLACEHOLDER,
+    )

+    res.document.save_as_markdown(
+        out_path / f"{res.input.file.stem}.md",
+        image_mode=ImageRefMode.PLACEHOLDER,
+    )
+
+    pg_num = res.document.num_pages()
    print("")
    inference_time = time.time() - start_time
    print(
--- a/docs/index.md
+++ b/docs/index.md
@ -13,6 +13,7 @@
 [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
 [![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
+[![LF AI & Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation&logoColor=fff&color=0094ff&labelColor=003778)](https://lfaidata.foundation/projects/)

 Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

@ -25,12 +26,12 @@ Docling simplifies document processing, parsing diverse formats — including ad
 * 🔒 Local execution capabilities for sensitive data and air-gapped environments
 * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
 * 🔍 Extensive OCR support for scanned PDFs and images
+* 🥚 Support of Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) 🆕🔥
 * 💻 Simple and convenient CLI

 ### Coming soon

 * 📝 Metadata extraction, including title, authors, references & language
-* 📝 Inclusion of Visual Language Models ([SmolDocling](https://huggingface.co/blog/smolervlm#smoldocling))
 * 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
 * 📝 Complex chemistry understanding (Molecular structures)

@ -43,9 +44,13 @@ Docling simplifies document processing, parsing diverse formats — including ad
  <a href="reference/document_converter/" class="card"><b>Reference</b><br />See more API details</a>
 </div>

-## IBM ❤️ Open Source AI
+## LF AI & Data

-Docling has been brought to you by IBM.
+Docling is hosted as a project in the [LF AI & Data Foundation](https://lfaidata.foundation/projects/).
+
+### IBM ❤️ Open Source AI
+
+The project was started by the AI for knowledge team at IBM Research Zurich.

 [supported_formats]: ./usage/supported_formats.md
 [docling_document]: ./concepts/docling_document.md
--- a/docs/integrations/apify.md
+++ b/docs/integrations/apify.md
@ -0,0 +1,35 @@
+You can run Docling in the cloud without installation using the [Docling Actor][apify] on Apify platform. Simply provide a document URL and get the processed result:
+
+<a href="https://apify.com/vancura/docling?fpr=docling"><img src="https://apify.com/ext/run-on-apify.png" alt="Run Docling Actor on Apify" width="176" height="39" /></a>
+
+```bash
+apify call vancura/docling -i '{
+  "options": {
+    "to_formats": ["md", "json", "html", "text", "doctags"]
+  },
+  "http_sources": [
+    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
+    {"url": "https://arxiv.org/pdf/2408.09869"}
+  ]
+}'
+```
+
+The Actor stores results in:
+
+* Processed document in key-value store (`OUTPUT_RESULT`)
+* Processing logs (`DOCLING_LOG`)
+* Dataset record with result URL and status
+
+Read more about the [Docling Actor](.actor/README.md), including how to use it via the Apify API and CLI.
+
+- 💻 [GitHub][github]
+- 📖 [Docs][docs]
+- 📦 [Docling Actor][apify]
+
+[github]: https://github.com/docling-project/docling/tree/main/.actor/
+[docs]: https://github.com/docling-project/docling/tree/main/.actor/README.md
+[apify]: https://apify.com/vancura/docling?fpr=docling
+
+
+
+
--- a/docs/usage/index.md
+++ b/docs/usage/index.md
@ -17,10 +17,15 @@ print(result.document.export_to_markdown())  # output: "### Docling Technical Re

 You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.

-A simple example would look like this:
 ```console
 docling https://arxiv.org/pdf/2206.01062
 ```
+You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
+```bash
+docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
+```
+This will use MLX acceleration on supported Apple Silicon hardware.
+

 To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md).

--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -111,6 +111,7 @@ nav:
      - "LlamaIndex": integrations/llamaindex.md
      - "txtai": integrations/txtai.md
    - ⭐️ Featured:
+      - "Apify": integrations/apify.md
      - "Data Prep Kit": integrations/data_prep_kit.md
      - "InstructLab": integrations/instructlab.md
      - "NVIDIA": integrations/nvidia.md
--- a/poetry.lock
+++ b/poetry.lock
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [tool.poetry]
 name = "docling"
-version = "2.26.0"  # DO NOT EDIT, updated automatically
+version = "2.28.0"  # DO NOT EDIT, updated automatically
 description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
 authors = [
  "Christoph Auer <cau@zurich.ibm.com>",
@ -46,9 +46,9 @@ packages = [{ include = "docling" }]
 ######################
 python = "^3.9"
 pydantic = "^2.0.0"
-docling-core = {extras = ["chunking"], version = "^2.23.0"}
+docling-core = {extras = ["chunking"], version = "^2.23.1"}
 docling-ibm-models = "^3.4.0"
-docling-parse = "^4.0.0"
+docling-parse = {git = "https://github.com/DS4SD/docling-parse", rev = "cau/line-sanitation-update"}
 filetype = "^1.2.0"
 pypdfium2 = "^4.30.0"
 pydantic-settings = "^2.3.0"
@ -88,6 +88,7 @@ accelerate = [
 ]
 pillow = ">=10.0.0,<12.0.0"
 tqdm = "^4.65.0"
+pluggy = "^1.0.0"
 pylatexenc = "^2.10"

 [tool.poetry.group.dev.dependencies]
@ -156,6 +157,9 @@ rapidocr = ["rapidocr-onnxruntime", "onnxruntime"]
 docling = "docling.cli.main:app"
 docling-tools = "docling.cli.tools:app"

+[tool.poetry.plugins."docling"]
+"docling_defaults" = "docling.models.plugins.defaults"
+
 [build-system]
 requires = ["poetry-core"]
 build-backend = "poetry.core.masonry.api"
@ -188,6 +192,7 @@ module = [
  "docling_ibm_models.*",
  "easyocr.*",
  "ocrmac.*",
+  "mlx_vlm.*",
  "lxml.*",
  "huggingface_hub.*",
  "transformers.*",
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
@ -1,4 +1,5 @@
 <document>
+<paragraph><location><page_1><loc_3><loc_75><loc_6><loc_80></location>2022</paragraph>
 <subtitle-level-1><location><page_1><loc_16><loc_85><loc_82><loc_86></location>TableFormer: Table Structure Understanding with Transformers.</subtitle-level-1>
 <subtitle-level-1><location><page_1><loc_23><loc_78><loc_74><loc_81></location>Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research</subtitle-level-1>
 <paragraph><location><page_1><loc_34><loc_77><loc_62><loc_78></location>{ ahn,nli,mly,taa @zurich.ibm.com }</paragraph>
@ -211,16 +212,16 @@
 <paragraph><location><page_9><loc_11><loc_85><loc_47><loc_90></location>- end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</paragraph>
 <paragraph><location><page_9><loc_9><loc_81><loc_47><loc_85></location>- [2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</paragraph>
 <paragraph><location><page_9><loc_9><loc_77><loc_47><loc_81></location>- [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</paragraph>
-<paragraph><location><page_9><loc_9><loc_71><loc_47><loc_76></location>- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</paragraph>
+<paragraph><location><page_9><loc_9><loc_71><loc_47><loc_76></location>- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</paragraph>
 <paragraph><location><page_9><loc_9><loc_66><loc_47><loc_71></location>- [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2</paragraph>
-<paragraph><location><page_9><loc_9><loc_60><loc_47><loc_65></location>- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</paragraph>
+<paragraph><location><page_9><loc_9><loc_60><loc_47><loc_65></location>- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</paragraph>
 <paragraph><location><page_9><loc_9><loc_56><loc_47><loc_60></location>- [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2</paragraph>
 <paragraph><location><page_9><loc_9><loc_49><loc_47><loc_56></location>- [8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1</paragraph>
 <paragraph><location><page_9><loc_9><loc_45><loc_47><loc_49></location>- [9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1</paragraph>
 <paragraph><location><page_9><loc_8><loc_39><loc_47><loc_44></location>- [10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2</paragraph>
 <paragraph><location><page_9><loc_8><loc_32><loc_47><loc_39></location>- [11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2</paragraph>
 <paragraph><location><page_9><loc_8><loc_25><loc_47><loc_32></location>- [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2</paragraph>
-<paragraph><location><page_9><loc_8><loc_18><loc_47><loc_25></location>- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</paragraph>
+<paragraph><location><page_9><loc_8><loc_18><loc_47><loc_25></location>- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</paragraph>
 <paragraph><location><page_9><loc_8><loc_14><loc_47><loc_18></location>- [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2</paragraph>
 <paragraph><location><page_9><loc_8><loc_10><loc_47><loc_14></location>- [15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</paragraph>
 <paragraph><location><page_9><loc_50><loc_82><loc_89><loc_90></location>- [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4</paragraph>
@ -229,7 +230,7 @@
 <paragraph><location><page_9><loc_50><loc_59><loc_89><loc_67></location>- [19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1</paragraph>
 <paragraph><location><page_9><loc_50><loc_53><loc_89><loc_58></location>- [20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2</paragraph>
 <paragraph><location><page_9><loc_50><loc_45><loc_89><loc_53></location>- [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1</paragraph>
-<paragraph><location><page_9><loc_50><loc_30><loc_89><loc_44></location>- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</paragraph>
+<paragraph><location><page_9><loc_50><loc_30><loc_89><loc_44></location>- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</paragraph>
 <paragraph><location><page_9><loc_50><loc_21><loc_89><loc_29></location>- [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1</paragraph>
 <paragraph><location><page_9><loc_50><loc_16><loc_89><loc_21></location>- [24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3</paragraph>
 <paragraph><location><page_9><loc_50><loc_10><loc_89><loc_15></location>- [25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on</paragraph>
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.json
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.json
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.md
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.md
@ -1,3 +1,5 @@
+2022
+
 ## TableFormer: Table Structure Understanding with Transformers.

 ## Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research
@ -276,11 +278,11 @@ In this paper, we presented TableFormer an end-to-end transformer based approach

 - [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2

- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
+- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2

 - [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2

- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
+- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2

 - [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2

@ -294,7 +296,7 @@ In this paper, we presented TableFormer an end-to-end transformer based approach

 - [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2

- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
+- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2

 - [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2

@ -312,7 +314,7 @@ In this paper, we presented TableFormer an end-to-end transformer based approach

 - [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1

- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
+- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6

 - [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1

--- a/tests/data/groundtruth/docling_v1/2203.01017v2.pages.json
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.pages.json
--- a/tests/data/groundtruth/docling_v1/2206.01062.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2206.01062.doctags.txt
@ -1,4 +1,5 @@
 <document>
+<paragraph><location><page_1><loc_3><loc_74><loc_6><loc_79></location>2022</paragraph>
 <subtitle-level-1><location><page_1><loc_18><loc_85><loc_83><loc_89></location>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</subtitle-level-1>
 <paragraph><location><page_1><loc_15><loc_77><loc_32><loc_83></location>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</paragraph>
 <paragraph><location><page_1><loc_42><loc_77><loc_58><loc_83></location>Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com</paragraph>
@ -24,7 +25,7 @@
 <paragraph><location><page_1><loc_52><loc_11><loc_91><loc_18></location>Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. 2022. DocLayNet: A Large Human-Annotated Dataset for DocumentLayout Analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22), August 14-18, 2022, Washington, DC, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/ 3534678.3539043</paragraph>
 <subtitle-level-1><location><page_2><loc_9><loc_88><loc_26><loc_89></location>1 INTRODUCTION</subtitle-level-1>
 <paragraph><location><page_2><loc_9><loc_71><loc_50><loc_86></location>Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.</paragraph>
-<paragraph><location><page_2><loc_9><loc_37><loc_48><loc_71></location>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</paragraph>
+<paragraph><location><page_2><loc_9><loc_37><loc_48><loc_71></location>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</paragraph>
 <paragraph><location><page_2><loc_9><loc_27><loc_48><loc_36></location>In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:</paragraph>
 <paragraph><location><page_2><loc_11><loc_22><loc_48><loc_26></location>- (1) Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.</paragraph>
 <paragraph><location><page_2><loc_11><loc_20><loc_48><loc_22></location>- (2) Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.</paragraph>
@ -168,7 +169,7 @@
 </table>
 <paragraph><location><page_7><loc_52><loc_47><loc_91><loc_58></location>lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items), the label set of size 4 is the closest to PubLayNet, in the assumption that the List is down-mapped to Text in PubLayNet. The results in Table 3 show that the prediction accuracy on the remaining class labels does not change significantly when other classes are merged into them. The overall macro-average improves by around 5%, in particular when Page-footer and Page-header are excluded.</paragraph>
 <subtitle-level-1><location><page_7><loc_52><loc_45><loc_90><loc_46></location>Impact of Document Split in Train and Test Set</subtitle-level-1>
-<paragraph><location><page_7><loc_52><loc_25><loc_91><loc_44></location>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</paragraph>
+<paragraph><location><page_7><loc_52><loc_25><loc_91><loc_44></location>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</paragraph>
 <subtitle-level-1><location><page_7><loc_52><loc_22><loc_68><loc_23></location>Dataset Comparison</subtitle-level-1>
 <paragraph><location><page_7><loc_52><loc_11><loc_91><loc_21></location>Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,</paragraph>
 <paragraph><location><page_8><loc_9><loc_81><loc_48><loc_89></location>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. Byevaluating on commonlabel classes of each dataset, we observe that the DocLayNet-trained model has muchless pronounced variations in performance across all datasets.</paragraph>
--- a/tests/data/groundtruth/docling_v1/2206.01062.json
+++ b/tests/data/groundtruth/docling_v1/2206.01062.json
--- a/tests/data/groundtruth/docling_v1/2206.01062.md
+++ b/tests/data/groundtruth/docling_v1/2206.01062.md
@ -1,3 +1,5 @@
+2022
+
 ## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

 Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com
@ -43,7 +45,7 @@ Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staa

 Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.

-Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
+Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.

 In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:

@ -99,7 +101,7 @@ The annotation campaign was carried out in four phases. In phase one, we identif
 Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.

 |                |         | % of Total   | % of Total   | % of Total   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   |
-|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
+|----------------|---------|--------------|--------------|--------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|
 | class label    | Count   | Train        | Test         | Val          | All                                        | Fin                                        | Man                                        | Sci                                        | Law                                        | Pat                                        | Ten                                        |
 | Caption        | 22524   | 2.04         | 1.77         | 2.32         | 84-89                                      | 40-61                                      | 86-92                                      | 94-99                                      | 95-99                                      | 69-78                                      | n/a                                        |
 | Footnote       | 6318    | 0.60         | 0.31         | 0.58         | 83-91                                      | n/a                                        | 100                                        | 62-88                                      | 85-94                                      | n/a                                        | 82-97                                      |
@ -239,7 +241,7 @@ lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items),

 ## Impact of Document Split in Train and Test Set

-Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
+Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.

 ## Dataset Comparison

--- a/tests/data/groundtruth/docling_v1/2206.01062.pages.json
+++ b/tests/data/groundtruth/docling_v1/2206.01062.pages.json
--- a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
--- a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
--- a/tests/data/groundtruth/docling_v1/2305.03393v1.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1.doctags.txt
@ -1,4 +1,5 @@
 <document>
+<paragraph><location><page_1><loc_3><loc_74><loc_6><loc_79></location>2023</paragraph>
 <subtitle-level-1><location><page_1><loc_22><loc_82><loc_79><loc_85></location>Optimized Table Tokenization for Table Structure Recognition</subtitle-level-1>
 <paragraph><location><page_1><loc_23><loc_75><loc_78><loc_79></location>Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]</paragraph>
 <paragraph><location><page_1><loc_38><loc_74><loc_49><loc_75></location>and Peter Staar</paragraph>
--- a/tests/data/groundtruth/docling_v1/2305.03393v1.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1.json
--- a/tests/data/groundtruth/docling_v1/2305.03393v1.md
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1.md
@ -1,3 +1,5 @@
+2023
+
 ## Optimized Table Tokenization for Table Structure Recognition

 Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]
--- a/tests/data/groundtruth/docling_v1/2305.03393v1.pages.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1.pages.json
--- a/tests/data/groundtruth/docling_v1/amt_handbook_sample.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/amt_handbook_sample.doctags.txt
@ -2,7 +2,7 @@
 <paragraph><location><page_1><loc_12><loc_88><loc_52><loc_94></location>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</paragraph>
 <paragraph><location><page_1><loc_12><loc_77><loc_52><loc_86></location>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; andthe elastic stop nut, representing the fiber insert type.</paragraph>
 <subtitle-level-1><location><page_1><loc_12><loc_74><loc_28><loc_75></location>Boots Self-Locking Nut</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_64><loc_52><loc_73></location>nut  is  of  one  piece,  all-metal The  Boots  self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</paragraph>
+<paragraph><location><page_1><loc_12><loc_64><loc_52><loc_73></location>The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</paragraph>
 <paragraph><location><page_1><loc_12><loc_52><loc_52><loc_61></location>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</paragraph>
 <paragraph><location><page_1><loc_12><loc_38><loc_52><loc_50></location>The spring, through the mediumofthe locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</paragraph>
 <paragraph><location><page_1><loc_12><loc_33><loc_52><loc_36></location>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</paragraph>
@ -10,8 +10,8 @@
 <location><page_1><loc_12><loc_10><loc_52><loc_31></location>
 <caption>Figure 7-26. Self-locking nuts.</caption>
 </figure>
-<paragraph><location><page_1><loc_54><loc_85><loc_94><loc_94></location>the most common ranges in size for No. 6 up to  1 4 inch, the / Rol-top ranges from  1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</paragraph>
-<paragraph><location><page_1><loc_54><loc_83><loc_55><loc_84></location>.</paragraph>
+<paragraph><location><page_1><loc_54><loc_85><loc_94><loc_94></location>the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</paragraph>
+<paragraph><location><page_1><loc_54><loc_83><loc_54><loc_84></location>.</paragraph>
 <subtitle-level-1><location><page_1><loc_54><loc_82><loc_76><loc_83></location>Stainless Steel Self-Locking Nut</subtitle-level-1>
 <paragraph><location><page_1><loc_54><loc_54><loc_94><loc_81></location>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</paragraph>
 <subtitle-level-1><location><page_1><loc_54><loc_51><loc_65><loc_52></location>Elastic Stop Nut</subtitle-level-1>
--- a/tests/data/groundtruth/docling_v1/amt_handbook_sample.json
+++ b/tests/data/groundtruth/docling_v1/amt_handbook_sample.json
--- a/tests/data/groundtruth/docling_v1/amt_handbook_sample.md
+++ b/tests/data/groundtruth/docling_v1/amt_handbook_sample.md
@ -4,7 +4,7 @@ The two general types of self-locking nuts currently in use are the all-metal ty

 ## Boots Self-Locking Nut

-nut  is  of  one  piece,  all-metal The  Boots  self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.
+The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.

 The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.

@ -15,7 +15,7 @@ Boots self-locking nuts are made with three different spring styles and in vario
 Figure 7-26. Self-locking nuts.
 <!-- image -->

-the most common ranges in size for No. 6 up to  1 4 inch, the / Rol-top ranges from  1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.
+the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.

 .

--- a/tests/data/groundtruth/docling_v1/amt_handbook_sample.pages.json
+++ b/tests/data/groundtruth/docling_v1/amt_handbook_sample.pages.json
--- a/tests/data/groundtruth/docling_v1/code_and_formula.json
+++ b/tests/data/groundtruth/docling_v1/code_and_formula.json
--- a/tests/data/groundtruth/docling_v1/code_and_formula.pages.json
+++ b/tests/data/groundtruth/docling_v1/code_and_formula.pages.json
--- a/tests/data/groundtruth/docling_v1/redp5110_sampled.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/redp5110_sampled.doctags.txt
@ -3,7 +3,7 @@
 <figure>
 <location><page_1><loc_84><loc_93><loc_96><loc_97></location>
 </figure>
-<subtitle-level-1><location><page_1><loc_6><loc_79><loc_96><loc_89></location>Row and Column Access Control Support in IBM DB2 for i</subtitle-level-1>
+<subtitle-level-1><location><page_1><loc_6><loc_79><loc_94><loc_89></location>RowandColumnAccessControl Support in IBM DB2 for i</subtitle-level-1>
 <figure>
 <location><page_1><loc_5><loc_11><loc_96><loc_63></location>
 </figure>
@ -16,7 +16,7 @@
 <figure>
 <location><page_3><loc_23><loc_64><loc_29><loc_66></location>
 </figure>
-<subtitle-level-1><location><page_3><loc_24><loc_57><loc_31><loc_59></location>Highlights</subtitle-level-1>
+<subtitle-level-1><location><page_3><loc_24><loc_57><loc_30><loc_59></location>Highlights</subtitle-level-1>
 <paragraph><location><page_3><loc_24><loc_55><loc_40><loc_57></location>- /g115/g3 /g40/g81/g75/g68/g81/g70/g72/g3 /g87/g75/g72/g3 /g83/g72/g85/g73/g82/g85/g80/g68/g81/g70/g72/g3 /g82/g73/g3 /g92/g82/g88/g85/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g82/g83/g72/g85/g68/g87/g76/g82/g81/g86</paragraph>
 <paragraph><location><page_3><loc_24><loc_51><loc_42><loc_54></location>- /g115/g3 /g40/g68/g85/g81/g3 /g74/g85/g72/g68/g87/g72/g85/g3 /g85 /g72/g87/g88/g85/g81/g3 /g82/g81/g3 /g44/g55/g3 /g83/g85/g82/g77/g72/g70/g87/g86/g3 /g87/g75/g85 /g82/g88/g74/g75/g3 /g80/g82/g71/g72/g85/g81/g76/g93/g68/g87/g76/g82/g81/g3 /g82/g73/g3 /g71/g68/g87/g68/g69/g68/g86/g72/g3 /g68/g81/g71/g3 /g68/g83/g83/g79/g76/g70/g68/g87/g76/g82/g81/g86</paragraph>
 <paragraph><location><page_3><loc_24><loc_48><loc_41><loc_50></location>- /g115/g3 /g53/g72/g79/g92/g3 /g82/g81/g3 /g44/g37/g48/g3 /g72/g91/g83/g72/g85/g87/g3 /g70/g82/g81/g86/g88/g79/g87/g76/g81/g74/g15/g3 /g86/g78/g76/g79/g79/g86/g3 /g86/g75/g68/g85/g76/g81/g74/g3 /g68/g81/g71/g3 /g85/g72/g81/g82/g90/g81/g3 /g86/g72/g85/g89/g76/g70/g72/g86</paragraph>
@ -25,10 +25,10 @@
 <location><page_3><loc_10><loc_13><loc_42><loc_24></location>
 </figure>
 <paragraph><location><page_3><loc_75><loc_82><loc_83><loc_83></location>Power Services</paragraph>
-<subtitle-level-1><location><page_3><loc_46><loc_65><loc_76><loc_71></location>DB2 for i Center of Excellence</subtitle-level-1>
+<subtitle-level-1><location><page_3><loc_46><loc_65><loc_75><loc_71></location>DB2 for i Center of Excellence</subtitle-level-1>
 <paragraph><location><page_3><loc_46><loc_64><loc_79><loc_65></location>Expert help to achieve your business requirements</paragraph>
 <subtitle-level-1><location><page_3><loc_46><loc_59><loc_72><loc_60></location>We build confident, satisfied clients</subtitle-level-1>
-<paragraph><location><page_3><loc_46><loc_56><loc_80><loc_59></location>No one else has the vast consulting experiences, skills sharing and renown service offerings to do what we can do for you.</paragraph>
+<paragraph><location><page_3><loc_46><loc_56><loc_79><loc_59></location>No one else has the vast consulting experiences, skills sharing and renown service offerings to do what we can do for you.</paragraph>
 <paragraph><location><page_3><loc_46><loc_54><loc_60><loc_55></location>Because no one else is IBM.</paragraph>
 <paragraph><location><page_3><loc_46><loc_46><loc_82><loc_52></location>With combined experiences and direct access to development groups, we're the experts in IBM DB2® for i. The DB2 for i Center of Excellence (CoE) can help you achieve-perhaps reexamine and exceed-your business requirements and gain more confidence and satisfaction in IBM product data management products and solutions.</paragraph>
 <subtitle-level-1><location><page_3><loc_46><loc_44><loc_71><loc_45></location>Who we are, some of what we do</subtitle-level-1>
@ -49,7 +49,7 @@
 <figure>
 <location><page_4><loc_23><loc_36><loc_41><loc_53></location>
 </figure>
-<paragraph><location><page_4><loc_43><loc_35><loc_89><loc_53></location>Jim Bainbridge is a senior DB2 consultant on the DB2 for i Center of Excellence team in the IBM Lab Services and Training organization. His primary role is training and implementation services for IBM DB2 Web Query for i and business analytics. Jim began his career with IBM 30 years ago in the IBM Rochester Development Lab, where he developed cooperative processing products that paired IBM PCs with IBM S/36 and AS/.400 systems. In the years since, Jim has held numerous technical roles, including independent software vendors technical support on a broad range of IBM technologies and products, and supporting customers in the IBM Executive Briefing Center and IBM Project Office.</paragraph>
+<paragraph><location><page_4><loc_43><loc_35><loc_88><loc_53></location>Jim Bainbridge is a senior DB2 consultant on the DB2 for i Center of Excellence team in the IBM Lab Services and Training organization. His primary role is training and implementation services for IBM DB2 Web Query for i and business analytics. Jim began his career with IBM 30 years ago in the IBM Rochester Development Lab, where he developed cooperative processing products that paired IBM PCs with IBM S/36 and AS/.400 systems. In the years since, Jim has held numerous technical roles, including independent software vendors technical support on a broad range of IBM technologies and products, and supporting customers in the IBM Executive Briefing Center and IBM Project Office.</paragraph>
 <figure>
 <location><page_4><loc_24><loc_20><loc_41><loc_33></location>
 </figure>
@ -60,7 +60,7 @@
 </figure>
 <paragraph><location><page_5><loc_82><loc_84><loc_85><loc_88></location>1</paragraph>
 <paragraph><location><page_5><loc_13><loc_65><loc_19><loc_66></location>Chapter 1.</paragraph>
-<subtitle-level-1><location><page_5><loc_22><loc_61><loc_90><loc_68></location>Securing and protecting IBM DB2 data</subtitle-level-1>
+<subtitle-level-1><location><page_5><loc_22><loc_61><loc_89><loc_68></location>Securing and protecting IBM DB2 data</subtitle-level-1>
 <paragraph><location><page_5><loc_22><loc_46><loc_89><loc_56></location>Recent news headlines are filled with reports of data breaches and cyber-attacks impacting global businesses of all sizes. The Identity Theft Resource Center 1 reports that almost 5000 data breaches have occurred since 2005, exposing over 600 million records of data. The financial cost of these data breaches is skyrocketing. Studies from the Ponemon Institute 2 revealed that the average cost of a data breach increased in 2013 by 15% globally and resulted in a brand equity loss of $9.4 million per attack. The average cost that is incurred for each lost record containing sensitive information increased more than 9% to $145 per record.</paragraph>
 <paragraph><location><page_5><loc_22><loc_38><loc_86><loc_44></location>Businesses must make a serious effort to secure their data and recognize that securing information assets is a cost of doing business. In many parts of the world and in many industries, securing the data is required by law and subject to audits. Data security is no longer an option; it is a requirement.</paragraph>
 <paragraph><location><page_5><loc_22><loc_34><loc_89><loc_37></location>This chapter describes how you can secure and protect data in DB2 for i. The following topics are covered in this chapter:</paragraph>
@ -71,14 +71,14 @@
 <paragraph><location><page_6><loc_22><loc_84><loc_89><loc_87></location>Before reviewing database security techniques, there are two fundamental steps in securing information assets that must be described:</paragraph>
 <paragraph><location><page_6><loc_22><loc_77><loc_89><loc_83></location>- /SM590000 First, and most important, is the definition of a company's security policy . Without a security policy, there is no definition of what are acceptable practices for using, accessing, and storing information by who, what, when, where, and how. A security policy should minimally address three things: confidentiality, integrity, and availability.</paragraph>
 <paragraph><location><page_6><loc_25><loc_66><loc_89><loc_76></location>- The monitoring and assessment of adherence to the security policy determines whether your security strategy is working. Often, IBM security consultants are asked to perform security assessments for companies without regard to the security policy. Although these assessments can be useful for observing how the system is defined and how data is being accessed, they cannot determine the level of security without a security policy. Without a security policy, it really is not an assessment as much as it is a baseline for monitoring the changes in the security settings that are captured.</paragraph>
-<paragraph><location><page_6><loc_25><loc_64><loc_89><loc_65></location>A security policy is what defines whether the system and its settings are secure (or not).</paragraph>
+<paragraph><location><page_6><loc_25><loc_64><loc_88><loc_65></location>A security policy is what defines whether the system and its settings are secure (or not).</paragraph>
 <paragraph><location><page_6><loc_22><loc_53><loc_89><loc_63></location>- /SM590000 The second fundamental in securing data assets is the use of resource security . If implemented properly, resource security prevents data breaches from both internal and external intrusions. Resource security controls are closely tied to the part of the security policy that defines who should have access to what information resources. A hacker might be good enough to get through your company firewalls and sift his way through to your system, but if they do not have explicit access to your database, the hacker cannot compromise your information assets.</paragraph>
 <paragraph><location><page_6><loc_22><loc_48><loc_87><loc_51></location>With your eyes now open to the importance of securing information assets, the rest of this chapter reviews the methods that are available for securing database resources on IBM i.</paragraph>
 <subtitle-level-1><location><page_6><loc_11><loc_43><loc_53><loc_45></location>1.2 Current state of IBM i security</subtitle-level-1>
 <paragraph><location><page_6><loc_22><loc_35><loc_89><loc_41></location>Because of the inherently secure nature of IBM i, many clients rely on the default system settings to protect their business data that is stored in DB2 for i. In most cases, this means no data protection because the default setting for the Create default public authority (QCRTAUT) system value is *CHANGE.</paragraph>
 <paragraph><location><page_6><loc_22><loc_26><loc_89><loc_33></location>Even more disturbing is that many IBM i clients remain in this state, despite the news headlines and the significant costs that are involved with databases being compromised. This default security configuration makes it quite challenging to implement basic security policies. A tighter implementation is required if you really want to protect one of your company's most valuable assets, which is the data.</paragraph>
 <paragraph><location><page_6><loc_22><loc_14><loc_89><loc_24></location>Traditionally, IBM i applications have employed menu-based security to counteract this default configuration that gives all users access to the data. The theory is that data is protected by the menu options controlling what database operations that the user can perform. This approach is ineffective, even if the user profile is restricted from running interactive commands. The reason is that in today's connected world there are a multitude of interfaces into the system, from web browsers to PC clients, that bypass application menus. If there are no object-level controls, users of these newer interfaces have an open door to your data.</paragraph>
-<paragraph><location><page_7><loc_22><loc_81><loc_89><loc_91></location>Many businesses are trying to limit data access to a need-to-know basis. This security goal means that users should be given access only to the minimum set of data that is required to perform their job. Often, users with object-level access are given access to row and column values that are beyond what their business task requires because that object-level security provides an all-or-nothing solution. For example, object-level controls allow a manager to access data about all employees. Most security policies limit a manager to accessing data only for the employees that they manage.</paragraph>
+<paragraph><location><page_7><loc_22><loc_81><loc_88><loc_91></location>Many businesses are trying to limit data access to a need-to-know basis. This security goal means that users should be given access only to the minimum set of data that is required to perform their job. Often, users with object-level access are given access to row and column values that are beyond what their business task requires because that object-level security provides an all-or-nothing solution. For example, object-level controls allow a manager to access data about all employees. Most security policies limit a manager to accessing data only for the employees that they manage.</paragraph>
 <subtitle-level-1><location><page_7><loc_11><loc_77><loc_49><loc_78></location>1.3.1 Existing row and column control</subtitle-level-1>
 <paragraph><location><page_7><loc_22><loc_68><loc_88><loc_75></location>Some IBM i clients have tried augmenting the all-or-nothing object-level security with SQL views (or logical files) and application logic, as shown in Figure 1-2. However, application-based logic is easy to bypass with all of the different data access interfaces that are provided by the IBM i operating system, such as Open Database Connectivity (ODBC) and System i Navigator.</paragraph>
 <paragraph><location><page_7><loc_22><loc_60><loc_89><loc_66></location>Using SQL views to limit access to a subset of the data in a table also has its own set of challenges. First, there is the complexity of managing all of the SQL view objects that are used for securing data access. Second, scaling a view-based security solution can be difficult as the amount of data grows and the number of users increases.</paragraph>
@ -92,10 +92,10 @@
 <paragraph><location><page_8><loc_22><loc_84><loc_49><loc_86></location>- /SM590000 Work Function Usage ( WRKFCNUSG )</paragraph>
 <paragraph><location><page_8><loc_22><loc_83><loc_51><loc_84></location>- /SM590000 Change Function Usage ( CHGFCNUSG )</paragraph>
 <paragraph><location><page_8><loc_22><loc_81><loc_51><loc_83></location>- /SM590000 Display Function Usage ( DSPFCNUSG )</paragraph>
-<paragraph><location><page_8><loc_22><loc_77><loc_84><loc_80></location>For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:</paragraph>
+<paragraph><location><page_8><loc_22><loc_77><loc_83><loc_80></location>For example, the following CHGFCNUSG command shows granting authorization to user HBEDOYA to administer and manage RCAC rules:</paragraph>
 <paragraph><location><page_8><loc_22><loc_75><loc_72><loc_76></location>CHGFCNUSG FCNID(QIBM_DB_SECADM) USER(HBEDOYA) USAGE(*ALLOWED)</paragraph>
 <subtitle-level-1><location><page_8><loc_11><loc_71><loc_89><loc_72></location>2.1.7 Verifying function usage IDs for RCAC with the FUNCTION_USAGE view</subtitle-level-1>
-<paragraph><location><page_8><loc_22><loc_66><loc_85><loc_69></location>The FUNCTION_USAGE view contains function usage configuration details. Table 2-1 describes the columns in the FUNCTION_USAGE view.</paragraph>
+<paragraph><location><page_8><loc_22><loc_66><loc_84><loc_69></location>The FUNCTION_USAGE view contains function usage configuration details. Table 2-1 describes the columns in the FUNCTION_USAGE view.</paragraph>
 <table>
 <location><page_8><loc_22><loc_44><loc_89><loc_63></location>
 <caption>Table 2-1 FUNCTION_USAGE view</caption>
@ -108,21 +108,25 @@
 <caption><location><page_8><loc_22><loc_64><loc_46><loc_65></location>Table 2-1 FUNCTION_USAGE view</caption>
 <paragraph><location><page_8><loc_22><loc_40><loc_89><loc_43></location>To discover who has authorization to define and manage RCAC, you can use the query that is shown in Example 2-1.</paragraph>
 <caption><location><page_8><loc_22><loc_38><loc_76><loc_39></location>Example 2-1 Query to determine who has authority to define and manage RCAC</caption>
-<paragraph><location><page_8><loc_22><loc_35><loc_41><loc_36></location>SELECT     function_id,</paragraph>
-<paragraph><location><page_8><loc_22><loc_34><loc_39><loc_35></location>user_name,</paragraph>
-<paragraph><location><page_8><loc_22><loc_32><loc_36><loc_33></location>usage,</paragraph>
-<paragraph><location><page_8><loc_22><loc_31><loc_39><loc_32></location>user_type</paragraph>
-<paragraph><location><page_8><loc_22><loc_29><loc_43><loc_30></location>FROM       function_usage</paragraph>
-<paragraph><location><page_8><loc_22><loc_28><loc_54><loc_29></location>WHERE      function_id='QIBM_DB_SECADM'</paragraph>
-<paragraph><location><page_8><loc_22><loc_26><loc_39><loc_27></location>ORDER BY   user_name;</paragraph>
+<paragraph><location><page_8><loc_22><loc_35><loc_27><loc_36></location>SELECT</paragraph>
+<paragraph><location><page_8><loc_31><loc_35><loc_41><loc_36></location>function_id,</paragraph>
+<paragraph><location><page_8><loc_31><loc_34><loc_39><loc_35></location>user_name,</paragraph>
+<paragraph><location><page_8><loc_31><loc_32><loc_36><loc_33></location>usage,</paragraph>
+<paragraph><location><page_8><loc_31><loc_31><loc_39><loc_32></location>user_type</paragraph>
+<paragraph><location><page_8><loc_22><loc_29><loc_26><loc_30></location>FROM</paragraph>
+<paragraph><location><page_8><loc_31><loc_29><loc_43><loc_30></location>function_usage</paragraph>
+<paragraph><location><page_8><loc_22><loc_28><loc_26><loc_29></location>WHERE</paragraph>
+<paragraph><location><page_8><loc_31><loc_28><loc_54><loc_29></location>function_id='QIBM_DB_SECADM'</paragraph>
+<paragraph><location><page_8><loc_22><loc_26><loc_29><loc_27></location>ORDER BY</paragraph>
+<paragraph><location><page_8><loc_31><loc_26><loc_39><loc_27></location>user_name;</paragraph>
 <subtitle-level-1><location><page_8><loc_11><loc_20><loc_41><loc_22></location>2.2 Separation of duties</subtitle-level-1>
 <paragraph><location><page_8><loc_22><loc_10><loc_89><loc_18></location>Separation of duties helps businesses comply with industry regulations or organizational requirements and simplifies the management of authorities. Separation of duties is commonly used to prevent fraudulent activities or errors by a single person. It provides the ability for administrative functions to be divided across individuals without overlapping responsibilities, so that one user does not possess unlimited authority, such as with the *ALLOBJ authority.</paragraph>
-<paragraph><location><page_9><loc_22><loc_82><loc_89><loc_91></location>For example, assume that a business has assigned the duty to manage security on IBM i to Theresa. Before release IBM i 7.2, to grant privileges, Theresa had to have the same privileges Theresa was granting to others. Therefore, to grant *USE privileges to the PAYROLL table, Theresa had to have *OBJMGT and *USE authority (or a higher level of authority, such as *ALLOBJ). This requirement allowed Theresa to access the data in the PAYROLL table even though Theresa's job description was only to manage its security.</paragraph>
+<paragraph><location><page_9><loc_22><loc_82><loc_88><loc_91></location>For example, assume that a business has assigned the duty to manage security on IBM i to Theresa. Before release IBM i 7.2, to grant privileges, Theresa had to have the same privileges Theresa was granting to others. Therefore, to grant *USE privileges to the PAYROLL table, Theresa had to have *OBJMGT and *USE authority (or a higher level of authority, such as *ALLOBJ). This requirement allowed Theresa to access the data in the PAYROLL table even though Theresa's job description was only to manage its security.</paragraph>
 <paragraph><location><page_9><loc_22><loc_75><loc_89><loc_81></location>In IBM i 7.2, the QIBM_DB_SECADM function usage grants authorities, revokes authorities, changes ownership, or changes the primary group without giving access to the object or, in the case of a database table, to the data that is in the table or allowing other operations on the table.</paragraph>
 <paragraph><location><page_9><loc_22><loc_71><loc_88><loc_73></location>QIBM_DB_SECADM function usage can be granted only by a user with *SECADM special authority and can be given to a user or a group.</paragraph>
 <paragraph><location><page_9><loc_22><loc_65><loc_89><loc_69></location>QIBM_DB_SECADM also is responsible for administering RCAC, which restricts which rows a user is allowed to access in a table and whether a user is allowed to see information in certain columns of a table.</paragraph>
 <paragraph><location><page_9><loc_22><loc_57><loc_88><loc_63></location>A preferred practice is that the RCAC administrator has the QIBM_DB_SECADM function usage ID, but absolutely no other data privileges. The result is that the RCAC administrator can deploy and maintain the RCAC constructs, but cannot grant themselves unauthorized access to data itself.</paragraph>
-<paragraph><location><page_9><loc_22><loc_53><loc_89><loc_56></location>Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL authority to the different CL commands and DB2 for i tools.</paragraph>
+<paragraph><location><page_9><loc_22><loc_53><loc_88><loc_56></location>Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL authority to the different CL commands and DB2 for i tools.</paragraph>
 <table>
 <location><page_9><loc_11><loc_9><loc_89><loc_50></location>
 <caption>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
@ -147,7 +151,7 @@
 <caption>Figure 3-1 CREATE PERMISSION SQL statement</caption>
 </figure>
 <subtitle-level-1><location><page_10><loc_22><loc_43><loc_35><loc_44></location>Column mask</subtitle-level-1>
-<paragraph><location><page_10><loc_22><loc_37><loc_89><loc_43></location>A column mask is a database object that manifests a column value access control rule for a specific column in a specific table. It uses a CASE expression that describes what you see when you access the column. For example, a teller can see only the last four digits of a tax identification number.</paragraph>
+<paragraph><location><page_10><loc_22><loc_37><loc_88><loc_43></location>A column mask is a database object that manifests a column value access control rule for a specific column in a specific table. It uses a CASE expression that describes what you see when you access the column. For example, a teller can see only the last four digits of a tax identification number.</paragraph>
 <caption><location><page_11><loc_22><loc_90><loc_67><loc_91></location>Table 3-1 summarizes these special registers and their values.</caption>
 <table>
 <location><page_11><loc_22><loc_74><loc_89><loc_87></location>
@ -169,7 +173,7 @@
 <caption>Figure 3-5 Special registers and adopted authority</caption>
 </figure>
 <subtitle-level-1><location><page_11><loc_11><loc_20><loc_40><loc_21></location>3.2.2 Built-in global variables</subtitle-level-1>
-<paragraph><location><page_11><loc_22><loc_15><loc_85><loc_18></location>Built-in global variables are provided with the database manager and are used in SQL statements to retrieve scalar values that are associated with the variables.</paragraph>
+<paragraph><location><page_11><loc_22><loc_15><loc_84><loc_18></location>Built-in global variables are provided with the database manager and are used in SQL statements to retrieve scalar values that are associated with the variables.</paragraph>
 <paragraph><location><page_11><loc_22><loc_9><loc_87><loc_13></location>IBM DB2 for i supports nine different built-in global variables that are read only and maintained by the system. These global variables can be used to identify attributes of the database connection and used as part of the RCAC logic.</paragraph>
 <paragraph><location><page_12><loc_22><loc_90><loc_56><loc_91></location>Table 3-2 lists the nine built-in global variables.</paragraph>
 <table>
@ -193,11 +197,12 @@
 <paragraph><location><page_12><loc_22><loc_36><loc_75><loc_38></location>Here is an example of using the VERIFY_GROUP_FOR_USER function:</paragraph>
 <paragraph><location><page_12><loc_22><loc_34><loc_66><loc_35></location>- 1. There are user profiles for MGR, JANE, JUDY, and TONY.</paragraph>
 <paragraph><location><page_12><loc_22><loc_32><loc_65><loc_33></location>- 2. The user profile JANE specifies a group profile of MGR.</paragraph>
-<paragraph><location><page_12><loc_22><loc_28><loc_88><loc_31></location>- 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:</paragraph>
-<paragraph><location><page_12><loc_25><loc_19><loc_74><loc_27></location>VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', 'STEVE') The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')</paragraph>
+<paragraph><location><page_12><loc_22><loc_28><loc_87><loc_31></location>- 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:</paragraph>
+<paragraph><location><page_12><loc_25><loc_19><loc_67><loc_27></location>VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')</paragraph>
+<paragraph><location><page_12><loc_67><loc_23><loc_74><loc_24></location>'STEVE')</paragraph>
 <paragraph><location><page_13><loc_22><loc_90><loc_27><loc_91></location>RETURN</paragraph>
 <paragraph><location><page_13><loc_22><loc_88><loc_26><loc_89></location>CASE</paragraph>
-<paragraph><location><page_13><loc_22><loc_67><loc_85><loc_88></location>WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' ||  MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-'     || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE  ;</paragraph>
+<paragraph><location><page_13><loc_23><loc_67><loc_85><loc_88></location>WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR', 'EMP' ) = 1 THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . DATE_OF_BIRTH WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 9999 || '-' || MONTH ( EMPLOYEES . DATE_OF_BIRTH ) || '-' || DAY (EMPLOYEES.DATE_OF_BIRTH )) ELSE NULL END ENABLE ;</paragraph>
 <paragraph><location><page_13><loc_22><loc_63><loc_89><loc_65></location>- 2. The other column to mask in this example is the TAX_ID information. In this example, the rules to enforce include the following ones:</paragraph>
 <paragraph><location><page_13><loc_25><loc_60><loc_77><loc_62></location>- -Human Resources can see the unmasked TAX_ID of the employees.</paragraph>
 <paragraph><location><page_13><loc_25><loc_58><loc_66><loc_59></location>- -Employees can see only their own unmasked TAX_ID.</paragraph>
@ -215,16 +220,21 @@
 <paragraph><location><page_14><loc_22><loc_67><loc_89><loc_71></location>Now that you have created the row permission and the two column masks, RCAC must be activated. The row permission and the two column masks are enabled (last clause in the scripts), but now you must activate RCAC on the table. To do so, complete the following steps:</paragraph>
 <paragraph><location><page_14><loc_22><loc_65><loc_67><loc_66></location>- 1. Run the SQL statements that are shown in Example 3-10.</paragraph>
 <subtitle-level-1><location><page_14><loc_22><loc_62><loc_61><loc_63></location>Example 3-10 Activating RCAC on the EMPLOYEES table</subtitle-level-1>
-<paragraph><location><page_14><loc_22><loc_60><loc_62><loc_61></location>- /*   Active Row Access Control (permissions)  */</paragraph>
-<paragraph><location><page_14><loc_22><loc_58><loc_62><loc_59></location>- /*   Active Column Access Control (masks)     */</paragraph>
-<paragraph><location><page_14><loc_22><loc_57><loc_48><loc_58></location>ALTER TABLE HR_SCHEMA.EMPLOYEES</paragraph>
-<paragraph><location><page_14><loc_22><loc_55><loc_44><loc_56></location>ACTIVATE ROW ACCESS CONTROL</paragraph>
+<paragraph><location><page_14><loc_22><loc_60><loc_58><loc_61></location>- /* Active Row Access Control (permissions)</paragraph>
+<paragraph><location><page_14><loc_60><loc_60><loc_62><loc_61></location>*/</paragraph>
+<paragraph><location><page_14><loc_22><loc_58><loc_56><loc_59></location>- /* Active Column Access Control (masks)</paragraph>
+<paragraph><location><page_14><loc_22><loc_57><loc_26><loc_58></location>ALTER</paragraph>
+<paragraph><location><page_14><loc_27><loc_57><loc_48><loc_58></location>TABLE HR_SCHEMA.EMPLOYEES</paragraph>
+<paragraph><location><page_14><loc_22><loc_55><loc_32><loc_56></location>ACTIVATE ROW</paragraph>
+<paragraph><location><page_14><loc_33><loc_55><loc_38><loc_56></location>ACCESS</paragraph>
+<paragraph><location><page_14><loc_39><loc_55><loc_44><loc_56></location>CONTROL</paragraph>
 <paragraph><location><page_14><loc_22><loc_54><loc_48><loc_55></location>ACTIVATE COLUMN ACCESS CONTROL;</paragraph>
 <paragraph><location><page_14><loc_22><loc_48><loc_88><loc_52></location>- 2. Look at the definition of the EMPLOYEE table, as shown in Figure 3-11. To do this, from the main navigation pane of System i Navigator, click Schemas  HR_SCHEMA  Tables , right-click the EMPLOYEES table, and click Definition .</paragraph>
 <figure>
 <location><page_14><loc_10><loc_18><loc_87><loc_46></location>
 <caption>Figure 3-11 Selecting the EMPLOYEES table from System i Navigator</caption>
 </figure>
+<paragraph><location><page_14><loc_60><loc_58><loc_62><loc_59></location>*/</paragraph>
 <paragraph><location><page_15><loc_22><loc_87><loc_84><loc_91></location>- 2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.</paragraph>
 <paragraph><location><page_15><loc_22><loc_32><loc_89><loc_36></location>- 3. Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.</paragraph>
 <figure>
@ -235,14 +245,15 @@
 <location><page_15><loc_11><loc_16><loc_83><loc_30></location>
 <caption>Figure 4-69 Index advice with no RCAC</caption>
 </figure>
-<paragraph><location><page_16><loc_11><loc_11><loc_82><loc_91></location>THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE  ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN  CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE  ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN  CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE  ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN  CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE  ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN  CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE  ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;</paragraph>
+<paragraph><location><page_16><loc_11><loc_11><loc_80><loc_91></location>THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;</paragraph>
+<paragraph><location><page_16><loc_80><loc_29><loc_81><loc_30></location>C</paragraph>
 <paragraph><location><page_18><loc_47><loc_94><loc_68><loc_96></location>Back cover</paragraph>
-<subtitle-level-1><location><page_18><loc_4><loc_82><loc_73><loc_91></location>Row and Column Access Control Support in IBM DB2 for i</subtitle-level-1>
-<paragraph><location><page_18><loc_4><loc_66><loc_21><loc_69></location>Implement roles and separation of duties</paragraph>
-<paragraph><location><page_18><loc_4><loc_59><loc_20><loc_64></location>Leverage row permissions on the database</paragraph>
-<paragraph><location><page_18><loc_25><loc_59><loc_68><loc_69></location>This IBM Redpaper publication provides information about the IBM i 7.2 feature of IBM DB2 for i Row and Column Access Control (RCAC). It offers a broad description of the function and advantages of controlling access to data in a comprehensive and transparent way. This publication helps you understand the capabilities of RCAC and provides examples of defining, creating, and implementing the row permissions and column masks in a relational database environment.</paragraph>
-<paragraph><location><page_18><loc_4><loc_52><loc_20><loc_57></location>Protect columns by defining column masks</paragraph>
-<paragraph><location><page_18><loc_25><loc_51><loc_68><loc_58></location>This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.</paragraph>
+<subtitle-level-1><location><page_18><loc_4><loc_82><loc_72><loc_91></location>RowandColumnAccessControl Support in IBM DB2 for i</subtitle-level-1>
+<paragraph><location><page_18><loc_4><loc_66><loc_20><loc_69></location>Implement roles and separation of duties</paragraph>
+<paragraph><location><page_18><loc_4><loc_59><loc_19><loc_64></location>Leverage row permissions on the database</paragraph>
+<paragraph><location><page_18><loc_25><loc_59><loc_67><loc_69></location>This IBM Redpaper publication provides information about the IBM i 7.2 feature of IBM DB2 for i Row and Column Access Control (RCAC). It offers a broad description of the function and advantages of controlling access to data in a comprehensive and transparent way. This publication helps you understand the capabilities of RCAC and provides examples of defining, creating, and implementing the row permissions and column masks in a relational database environment.</paragraph>
+<paragraph><location><page_18><loc_4><loc_52><loc_19><loc_57></location>Protect columns by defining column masks</paragraph>
+<paragraph><location><page_18><loc_25><loc_51><loc_67><loc_58></location>This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.</paragraph>
 <figure>
 <location><page_18><loc_79><loc_93><loc_93><loc_97></location>
 </figure>
--- a/tests/data/groundtruth/docling_v1/redp5110_sampled.json
+++ b/tests/data/groundtruth/docling_v1/redp5110_sampled.json
--- a/tests/data/groundtruth/docling_v1/redp5110_sampled.md
+++ b/tests/data/groundtruth/docling_v1/redp5110_sampled.md
@ -168,7 +168,9 @@ To discover who has authorization to define and manage RCAC, you can use the que

 Example 2-1 Query to determine who has authority to define and manage RCAC

-SELECT     function_id,
+SELECT
+
+function_id,

 user_name,

@ -176,11 +178,17 @@ usage,

 user_type

-FROM       function_usage
+FROM

-WHERE      function_id='QIBM_DB_SECADM'
+function_usage

-ORDER BY   user_name;
+WHERE
+
+function_id='QIBM_DB_SECADM'
+
+ORDER BY
+
+user_name;

 ## 2.2 Separation of duties

@ -201,7 +209,7 @@ Table 2-2 shows a comparison of the different function usage IDs and *JOBCTL aut
 Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority

 | User action                                                                 | *JOBCTL   | QIBM_DB_SECADM   | QIBM_DB_SQLADM   | QIBM_DB_SYSMON   | No Authority   |
-|--------------------------------------------------------------------------------|-----------|------------------|------------------|------------------|----------------|
+|-----------------------------------------------------------------------------|-----------|------------------|------------------|------------------|----------------|
 | SET CURRENT DEGREE (SQL statement)                                          | X         |                  | X                |                  |                |
 | CHGQRYA command targeting a different user's job                            | X         |                  | X                |                  |                |
 | STRDBMON or ENDDBMON commands targeting a different user's job              | X         |                  | X                |                  |                |
@ -229,7 +237,7 @@ Table 3-1 summarizes these special registers and their values.
 Table 3-1 Special registers and their corresponding values

 | Special register     | Corresponding value                                                                                                                  |
-|----------------------|---------------------------------------------------------------------------------------------------------------------------------------|
+|----------------------|--------------------------------------------------------------------------------------------------------------------------------------|
 | USER or SESSION_USER | The effective user of the thread excluding adopted authority.                                                                        |
 | CURRENT_USER         | The effective user of the thread including adopted authority. When no adopted authority is present, this has the same value as USER. |
 | SYSTEM_USER          | The authorization ID that initiated the connection.                                                                                  |
@ -285,7 +293,9 @@ Here is an example of using the VERIFY_GROUP_FOR_USER function:

 - 3. If a user is connected to the server using user profile JANE, all of the following function invocations return a value of 1:

-VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', 'STEVE') The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')
+VERIFY_GROUP_FOR_USER (CURRENT_USER, 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR') VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JANE', 'MGR', The following function invocation returns a value of 0: VERIFY_GROUP_FOR_USER (CURRENT_USER, 'JUDY', 'TONY')
+
+'STEVE')

 RETURN

@ -322,13 +332,21 @@ Now that you have created the row permission and the two column masks, RCAC must

 ## Example 3-10 Activating RCAC on the EMPLOYEES table

- /*   Active Row Access Control (permissions)  */
+- /* Active Row Access Control (permissions)

- /*   Active Column Access Control (masks)     */
+*/

-ALTER TABLE HR_SCHEMA.EMPLOYEES
+- /* Active Column Access Control (masks)

-ACTIVATE ROW ACCESS CONTROL
+ALTER
+
+TABLE HR_SCHEMA.EMPLOYEES
+
+ACTIVATE ROW
+
+ACCESS
+
+CONTROL

 ACTIVATE COLUMN ACCESS CONTROL;

@ -337,6 +355,8 @@ ACTIVATE COLUMN ACCESS CONTROL;
 Figure 3-11 Selecting the EMPLOYEES table from System i Navigator
 <!-- image -->

+*/
+
 - 2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.

 - 3. Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.
@ -347,7 +367,9 @@ Figure 4-68   Visual Explain with RCAC enabled
 Figure 4-69 Index advice with no RCAC
 <!-- image -->

-THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE  ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN  CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE  ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN  CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE  ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN  CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE  ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN  CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE  ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;
+THEN C . CUSTOMER_TAX_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( C . CUSTOMER_TAX_ID , 8 , 4 ) ) WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_DRIVERS_LICENSE_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_DRIVERS_LICENSE_NUMBER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'TELLER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_DRIVERS_LICENSE_NUMBER ELSE '*************' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_LOGIN_ID_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_LOGIN_ID RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_LOGIN_ID WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_LOGIN_ID ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS C FOR COLUMN CUSTOMER_SECURITY_QUESTION RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION ELSE '*****' END ENABLE ; CREATE MASK BANK_SCHEMA.MASK_SECURITY_QUESTION_ANSWER_ON_CUSTOMERS ON BANK_SCHEMA.CUSTOMERS AS FOR COLUMN CUSTOMER_SECURITY_QUESTION_ANSWER RETURN CASE WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'ADMIN' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER WHEN QSYS2 . VERIFY_GROUP_FOR_USER ( SESSION_USER , 'CUSTOMER' ) = 1 THEN C . CUSTOMER_SECURITY_QUESTION_ANSWER ELSE '*****' END ENABLE ; ALTER TABLE BANK_SCHEMA.CUSTOMERS ACTIVATE ROW ACCESS CONTROL ACTIVATE COLUMN ACCESS CONTROL ;
+
+C

 Back cover

--- a/tests/data/groundtruth/docling_v1/redp5110_sampled.pages.json
+++ b/tests/data/groundtruth/docling_v1/redp5110_sampled.pages.json
--- a/tests/data/groundtruth/docling_v1/right_to_left_01.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/right_to_left_01.doctags.txt
@ -1,7 +1,7 @@
 <document>
 <subtitle-level-1><location><page_1><loc_37><loc_89><loc_85><loc_90></location>تحسين الإنتاجية وحل المشكلات من خلال البرمجة بلغة R و Python</subtitle-level-1>
 <paragraph><location><page_1><loc_15><loc_80><loc_85><loc_87></location>تعتبر البرمجة بلغة R و Python من الأدوات القوية التي يمكن أن تعزز الإنتاجية وتساعد في إيجاد حلول فعالة للمشكلات. يمتلك كل من R و Python ميزات فريدة تجعلها مثالية لتحليل البيانات، مما يسهل على المحللين والعلماء إجراء تحليلات معقدة بطريقة سريعة وفعالة. إذا كان لديك عقلية تحليلية، فإن استخدام هذه اللغات يمكن أن يسهم بشكل كبير في تحسين نتائج العمل .</paragraph>
-<paragraph><location><page_1><loc_16><loc_72><loc_85><loc_78></location>عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً  على استنتاجات قائمة على البيانات .</paragraph>
-<paragraph><location><page_1><loc_15><loc_63><loc_85><loc_69></location>علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم البياني والتحليل الإ حصائي، مما يجعلها مثالية للباحثين والمحللين .</paragraph>
+<paragraph><location><page_1><loc_17><loc_72><loc_85><loc_78></location>عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .</paragraph>
+<paragraph><location><page_1><loc_16><loc_63><loc_85><loc_69></location>علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم الإ البياني والتحليل حصائي، مما يجعلها مثالية للباحثين والمحللين .</paragraph>
 <paragraph><location><page_1><loc_16><loc_56><loc_85><loc_61></location>في النهاية، يمكن أن تؤدي البرمجة بلغة R و Python مع عقلية تحليلية إلى تحسين الإنتاجية وتوفير حلول مبتكرة للمشكلات المعقدة. إن القدرة على تحليل البيانات بشكل فعال وتطبيق الأساليب البرمجية المناسبة يمكن أن تكون له ا تأثيرات إيجابية بعيدة المدى على الأداء الشخصي والمهني .</paragraph>
 </document>
--- a/tests/data/groundtruth/docling_v1/right_to_left_01.json
+++ b/tests/data/groundtruth/docling_v1/right_to_left_01.json
--- a/tests/data/groundtruth/docling_v1/right_to_left_01.md
+++ b/tests/data/groundtruth/docling_v1/right_to_left_01.md
@ -4,6 +4,6 @@

 عندما يجتمع التفكير التحليلي مع مهارات البرمجة، يصبح من الممكن معالجة كميات هائلة من البيانات واستخراج الأنماط والتوجهات منها. يمكن للمبرمجين استخدام R و Python لتنفيذ عمليات تحليلية متقدمة، مثل النمذجة الإحصائية وتحليل البيانات الكبيرة. هذا ليس فقط يوفر الوقت، بل يمكن أن يؤدي أيضًا إلى اتخاذ قرارات أكثر دقة بناء ً على استنتاجات قائمة على البيانات .

-علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم البياني والتحليل الإ حصائي، مما يجعلها مثالية للباحثين والمحللين .
+علاوة على ذلك، توفر كل من R و Python مكتبات وأدوات غنية تدعم مجموعة واسعة م ن التطبيقات، من التحليل البياني إلى التعلم الآلي. يمكن للمستخدمين الاستفادة من هذه المكتبات لتطوير حلول مبتكرة للمشكلات المختلفة. على سبيل المثال، يمكن استخدام مكتبة pandas في Python لإدارة البيانات بكفاءة، بينما توفر R أدوات قوية للرسم الإ البياني والتحليل حصائي، مما يجعلها مثالية للباحثين والمحللين .

 في النهاية، يمكن أن تؤدي البرمجة بلغة R و Python مع عقلية تحليلية إلى تحسين الإنتاجية وتوفير حلول مبتكرة للمشكلات المعقدة. إن القدرة على تحليل البيانات بشكل فعال وتطبيق الأساليب البرمجية المناسبة يمكن أن تكون له ا تأثيرات إيجابية بعيدة المدى على الأداء الشخصي والمهني .
--- a/tests/data/groundtruth/docling_v1/right_to_left_01.pages.json
+++ b/tests/data/groundtruth/docling_v1/right_to_left_01.pages.json
--- a/tests/data/groundtruth/docling_v1/right_to_left_02.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/right_to_left_02.doctags.txt
@ -1,9 +1,9 @@
 <document>
 <paragraph><location><page_1><loc_8><loc_3><loc_10><loc_4></location>11</paragraph>
-<paragraph><location><page_1><loc_11><loc_50><loc_73><loc_75></location>وعليه، فإن الحكومة المصرية تضع صوو  عييهاوخ لوال المر اوة الم باوة تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح يق يدد من الأ هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح يووق معوودحت نمووو قويوووة ومسووولدامة و وووخماة فوووا عذاوووف ال لخيوووخت، و ووو ا الح وووخ  ياوووى محوددات الأمون ال ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو ة، ومواصوواة واووود تلوووير الماووخر ة السيخسووية، واسوولمرار ملخبعووة ما و و خت الأمووون واحسووول رار ومكخفحوووة الإرهوووخ  ، تلووووير ما وووخت ال  خفوووة والوووويا الوووو،ها، والبلوووخ  الوووديها المعلووودل ياوووى الهحوووو الووو ي يرسووو  م وووخهيل الموا،هة والسام المجلمعا .</paragraph>
-<paragraph><location><page_1><loc_13><loc_45><loc_74><loc_48></location>ووف ًخ لمخ سبق، يسولاد  برنوخما الحكوموة المصورية لوال ال لور ( 2024 -2026 ) تح يق عربعة عهدا  اسلراتيجية رئيسة، وها ياى الهحو الآتا :</paragraph>
-<paragraph><location><page_1><loc_12><loc_37><loc_73><loc_40></location>مايــــــــة اممــــــــن القومي المصـر بنــــاء ا نســــا المصــــــــــــــــــــر بنـــــاء ا تصـــــاع تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي</paragraph>
-<paragraph><location><page_1><loc_11><loc_23><loc_73><loc_31></location>تجدر الإ خر  إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ د باوكل رئووويس ياوووى مسووولادفخت ر يوووة مصووور 2023 ، وتوصووويخت واسوووخت الحووووار الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا خت الايك ايوة، ومبلاف احسلراتيجيخت الو،هية .</paragraph>
+<paragraph><location><page_1><loc_12><loc_50><loc_73><loc_75></location>وعليه، اوة الم عييهاوخ لوال المر فإن الحكومة المصرية تضع صوو باوة يق يدد من الأ تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، يووق معوودحت نمووو لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح ياوووى وووخ ا الح ووو لخيوووخت، و وووخماة فوووا عذاوووف ال قويوووة ومسووولدامة و ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو محوددات الأمون ال ة، و و ة السيخسووية، واسوولمرار ملخبعووة ما ومواصوواة واووود تلوووير الماووخر خت خفوووة والوووويا وووخت ال ، تلووووير ما رار ومكخفحوووة الإرهوووخ الأمووون واحسووول وووخهيل م ي يرسووو الووو الوووديها المعلووودل ياوووى الهحوووو الوووو،ها، والبلوووخ الموا،هة والسام المجلمعا .</paragraph>
+<paragraph><location><page_1><loc_13><loc_45><loc_74><loc_48></location>لور برنوخما الحكوموة المصورية لوال ال ًخ لمخ سبق، يسولاد ووف ( 2024 -2026 ) اسلراتيجية رئيسة، وها ياى الهحو الآتا يق عربعة عهدا تح :</paragraph>
+<paragraph><location><page_1><loc_12><loc_37><loc_73><loc_40></location>مايــــــــة اممــــــــن القومي المصـر نسـ ـــا بنــــاء ا المصـ ـــــــــــــــــــر تصـــــاع بنـــــاء ا تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي</paragraph>
+<paragraph><location><page_1><loc_12><loc_23><loc_73><loc_31></location>إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ خر تجدر الإ د باوكل يوووة مصووور رئووويس ياوووى مسووولادفخت ر 2023 ، وتوصووويخت واسوووخت الحووووار خت الايك الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا ايوة، ومبلاف احسلراتيجيخت الو،هية .</paragraph>
 <figure>
 <location><page_1><loc_75><loc_23><loc_100><loc_76></location>
 </figure>
--- a/tests/data/groundtruth/docling_v1/right_to_left_02.json
+++ b/tests/data/groundtruth/docling_v1/right_to_left_02.json
--- a/tests/data/groundtruth/docling_v1/right_to_left_02.md
+++ b/tests/data/groundtruth/docling_v1/right_to_left_02.md
@ -1,11 +1,11 @@
 11

-وعليه، فإن الحكومة المصرية تضع صوو  عييهاوخ لوال المر اوة الم باوة تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح يق يدد من الأ هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح يووق معوودحت نمووو قويوووة ومسووولدامة و وووخماة فوووا عذاوووف ال لخيوووخت، و ووو ا الح وووخ  ياوووى محوددات الأمون ال ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو ة، ومواصوواة واووود تلوووير الماووخر ة السيخسووية، واسوولمرار ملخبعووة ما و و خت الأمووون واحسووول رار ومكخفحوووة الإرهوووخ  ، تلووووير ما وووخت ال  خفوووة والوووويا الوووو،ها، والبلوووخ  الوووديها المعلووودل ياوووى الهحوووو الووو ي يرسووو  م وووخهيل الموا،هة والسام المجلمعا .
+وعليه، اوة الم عييهاوخ لوال المر فإن الحكومة المصرية تضع صوو باوة يق يدد من الأ تكايف السيد رئيس الجماورية لاخ بخلعمل ياى تح هودا ياى رعساخ : وضع ماف بهخء الإنسخن المصري ياى رعس قخئموة الأولويو خت، يووق معوودحت نمووو لخصووة فووا مجووخحت الصووحة واللعاوويل، العموول ياووى تح ياوووى وووخ ا الح ووو لخيوووخت، و وووخماة فوووا عذاوووف ال قويوووة ومسووولدامة و ووما المصوري فوا ضووء اللحوديخت الإقايميوة والدوليو محوددات الأمون ال ة، و و ة السيخسووية، واسوولمرار ملخبعووة ما ومواصوواة واووود تلوووير الماووخر خت خفوووة والوووويا وووخت ال ، تلووووير ما رار ومكخفحوووة الإرهوووخ الأمووون واحسووول وووخهيل م ي يرسووو الووو الوووديها المعلووودل ياوووى الهحوووو الوووو،ها، والبلوووخ الموا،هة والسام المجلمعا .

-ووف ًخ لمخ سبق، يسولاد  برنوخما الحكوموة المصورية لوال ال لور ( 2024 -2026 ) تح يق عربعة عهدا  اسلراتيجية رئيسة، وها ياى الهحو الآتا :
+لور برنوخما الحكوموة المصورية لوال ال ًخ لمخ سبق، يسولاد ووف ( 2024 -2026 ) اسلراتيجية رئيسة، وها ياى الهحو الآتا يق عربعة عهدا تح :

-مايــــــــة اممــــــــن القومي المصـر بنــــاء ا نســــا المصــــــــــــــــــــر بنـــــاء ا تصـــــاع تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي
+مايــــــــة اممــــــــن القومي المصـر نسـ ـــا بنــــاء ا المصـ ـــــــــــــــــــر تصـــــاع بنـــــاء ا تنابســــــــــــــــــــــي تحقيق اظستق رار السياســــــــــــــــــــــــي

-تجدر الإ خر  إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ د باوكل رئووويس ياوووى مسووولادفخت ر يوووة مصووور 2023 ، وتوصووويخت واسوووخت الحووووار الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا خت الايك ايوة، ومبلاف احسلراتيجيخت الو،هية .
+إلى عنه قد تل تحديد مسولادفخت البرنوخما بخحسولهخ خر تجدر الإ د باوكل يوووة مصووور رئووويس ياوووى مسووولادفخت ر 2023 ، وتوصووويخت واسوووخت الحووووار خت الايك الوو،ها، ومسولادفخت الوو ارات، والبرنوخما الوو،ها ليصوا ايوة، ومبلاف احسلراتيجيخت الو،هية .

 <!-- image -->
--- a/tests/data/groundtruth/docling_v1/right_to_left_02.pages.json
+++ b/tests/data/groundtruth/docling_v1/right_to_left_02.pages.json
--- a/tests/data/groundtruth/docling_v1/right_to_left_03.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/right_to_left_03.doctags.txt
@ -5,28 +5,28 @@
 </figure>
 <subtitle-level-1><location><page_1><loc_63><loc_81><loc_81><loc_84></location>2-5 -استاندارد ک الا</subtitle-level-1>
 <paragraph><location><page_1><loc_77><loc_79><loc_87><loc_81></location>نام استاندارد</paragraph>
-<paragraph><location><page_1><loc_11><loc_75><loc_44><loc_81></location>شمشه  و  شمشال  توليد  شده  به  روش  ريخته  گری پيوسته مورد مصرف در  فولادهای  سازه  ای - مطابق آناليز پيوست</paragraph>
+<paragraph><location><page_1><loc_12><loc_75><loc_44><loc_81></location>ريخته گری به روش شده توليد و شمشال شمشه پيوسته مورد مصرف سازه ای فولادهای در - مطابق آناليز پيوست</paragraph>
 <paragraph><location><page_1><loc_71><loc_72><loc_87><loc_74></location>شماره استاندارد ملی</paragraph>
 <paragraph><location><page_1><loc_40><loc_73><loc_45><loc_74></location>20300</paragraph>
 <paragraph><location><page_1><loc_68><loc_70><loc_87><loc_72></location>استاندارد اجباری است؟</paragraph>
 <paragraph><location><page_1><loc_65><loc_67><loc_87><loc_69></location>مرجع صادرکننده استاندارد</paragraph>
 <paragraph><location><page_1><loc_28><loc_67><loc_44><loc_69></location>سازمان ملی استاندارد ايران</paragraph>
-<paragraph><location><page_1><loc_49><loc_62><loc_87><loc_66></location>آيا توليدکننده محصول، استاندارد مذکور را اخذ نموده است؟</paragraph>
+<paragraph><location><page_1><loc_50><loc_62><loc_87><loc_66></location>آيا توليدکننده محصول، استاندارد مذکور را اخذ نموده است؟</paragraph>
 <subtitle-level-1><location><page_1><loc_69><loc_56><loc_85><loc_58></location>3 -پذيرش در بورس</subtitle-level-1>
-<paragraph><location><page_1><loc_68><loc_54><loc_83><loc_56></location>تاريخ ارائه مدارک</paragraph>
+<paragraph><location><page_1><loc_69><loc_54><loc_83><loc_56></location>تاريخ ارائه مدارک</paragraph>
 <paragraph><location><page_1><loc_23><loc_54><loc_32><loc_56></location>19 / 09 / 1403</paragraph>
 <paragraph><location><page_1><loc_72><loc_51><loc_83><loc_53></location>تاريخ پذيرش</paragraph>
 <paragraph><location><page_1><loc_23><loc_51><loc_32><loc_53></location>04 / 10 / 1403</paragraph>
 <paragraph><location><page_1><loc_62><loc_48><loc_83><loc_50></location>شماره جلسه کميته عرضه</paragraph>
 <paragraph><location><page_1><loc_26><loc_49><loc_29><loc_50></location>436</paragraph>
-<paragraph><location><page_1><loc_67><loc_45><loc_83><loc_47></location>تاريخ درج اميدنامه</paragraph>
+<paragraph><location><page_1><loc_68><loc_45><loc_83><loc_47></location>تاريخ درج اميدنامه</paragraph>
 <paragraph><location><page_1><loc_23><loc_46><loc_32><loc_48></location>05 / 10 / 1403</paragraph>
-<paragraph><location><page_1><loc_71><loc_43><loc_83><loc_45></location>مشاور پذيرش</paragraph>
+<paragraph><location><page_1><loc_72><loc_43><loc_83><loc_45></location>مشاور پذيرش</paragraph>
 <paragraph><location><page_1><loc_21><loc_43><loc_34><loc_45></location>کارگزاری آ رمون بورس</paragraph>
-<paragraph><location><page_1><loc_47><loc_37><loc_83><loc_42></location>نحوة تعيين قيمت پايه پس از پذيرش کالا در بورس</paragraph>
-<paragraph><location><page_1><loc_18><loc_40><loc_36><loc_42></location>بر اساس قيمت های جهانی</paragraph>
+<paragraph><location><page_1><loc_48><loc_37><loc_83><loc_42></location>نحوة تعيين قيمت پايهپس از پذيرش کالا در بورس</paragraph>
+<paragraph><location><page_1><loc_19><loc_40><loc_36><loc_42></location>بر اساس قيمت های جهانی</paragraph>
 <paragraph><location><page_1><loc_45><loc_32><loc_83><loc_37></location>حداقل درصد عرضه از توليد / کل فروش / فروش داخلی</paragraph>
-<paragraph><location><page_1><loc_14><loc_35><loc_40><loc_37></location>حداقل 50 % از توليد ساليانه يا 47.500 تن</paragraph>
+<paragraph><location><page_1><loc_14><loc_35><loc_40><loc_37></location>حداقل 50 % يا از توليد ساليانه 47.500 تن</paragraph>
 <paragraph><location><page_1><loc_68><loc_29><loc_83><loc_31></location>خطای مجاز تحويل</paragraph>
 <paragraph><location><page_1><loc_18><loc_30><loc_37><loc_31></location>5% آخرين محموله قابل تحويل</paragraph>
 </document>
--- a/tests/data/groundtruth/docling_v1/right_to_left_03.json
+++ b/tests/data/groundtruth/docling_v1/right_to_left_03.json
--- a/tests/data/groundtruth/docling_v1/right_to_left_03.md
+++ b/tests/data/groundtruth/docling_v1/right_to_left_03.md
@ -6,7 +6,7 @@

 نام استاندارد

-شمشه  و  شمشال  توليد  شده  به  روش  ريخته  گری پيوسته مورد مصرف در  فولادهای  سازه  ای - مطابق آناليز پيوست
+ريخته گری به روش شده توليد و شمشال شمشه پيوسته مورد مصرف سازه ای فولادهای در - مطابق آناليز پيوست

 شماره استاندارد ملی

@ -48,7 +48,7 @@

 حداقل درصد عرضه از توليد / کل فروش / فروش داخلی

-حداقل 50 % از توليد ساليانه يا 47.500 تن
+حداقل 50 % يا از توليد ساليانه 47.500 تن

 خطای مجاز تحويل

--- a/tests/data/groundtruth/docling_v1/right_to_left_03.pages.json
+++ b/tests/data/groundtruth/docling_v1/right_to_left_03.pages.json
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
@ -1,4 +1,5 @@
-<doctag><page_header><loc_15><loc_101><loc_30><loc_354>arXiv:2203.01017v2  [cs.CV]  11 Mar 2022</page_header>
+<doctag><page_header><loc_15><loc_133><loc_30><loc_354>arXiv:2203.01017v2 [cs.CV] 11 Mar</page_header>
+<text><loc_15><loc_101><loc_30><loc_126>2022</text>
 <section_header_level_1><loc_79><loc_68><loc_408><loc_76>TableFormer: Table Structure Understanding with Transformers.</section_header_level_1>
 <section_header_level_1><loc_116><loc_93><loc_370><loc_108>Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research</section_header_level_1>
 <text><loc_170><loc_111><loc_309><loc_116>{ ahn,nli,mly,taa @zurich.ibm.com }</text>
@ -130,16 +131,16 @@
 <list_item><loc_57><loc_48><loc_234><loc_74>end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</list_item>
 <list_item><loc_45><loc_76><loc_234><loc_95>[2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</list_item>
 <list_item><loc_45><loc_97><loc_234><loc_116>[3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</list_item>
-<list_item><loc_45><loc_118><loc_234><loc_143>[4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</list_item>
+<list_item><loc_45><loc_118><loc_234><loc_143>[4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</list_item>
 <list_item><loc_45><loc_146><loc_234><loc_171>[5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2</list_item>
-<list_item><loc_45><loc_174><loc_234><loc_199>[6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</list_item>
+<list_item><loc_45><loc_174><loc_234><loc_199>[6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2</list_item>
 <list_item><loc_45><loc_201><loc_234><loc_220>[7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2</list_item>
 <list_item><loc_45><loc_222><loc_234><loc_255>[8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1</list_item>
 <list_item><loc_45><loc_257><loc_234><loc_276>[9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1</list_item>
 <list_item><loc_41><loc_278><loc_234><loc_304>[10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2</list_item>
 <list_item><loc_41><loc_306><loc_234><loc_339>[11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2</list_item>
 <list_item><loc_41><loc_341><loc_234><loc_373>[12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2</list_item>
-<list_item><loc_41><loc_376><loc_234><loc_408>[13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</list_item>
+<list_item><loc_41><loc_376><loc_234><loc_408>[13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2</list_item>
 <list_item><loc_41><loc_410><loc_234><loc_429>[14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2</list_item>
 <list_item><loc_41><loc_431><loc_234><loc_450>[15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6</list_item>
 <list_item><loc_252><loc_48><loc_445><loc_88>[16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4</list_item>
@ -148,7 +149,7 @@
 <list_item><loc_252><loc_167><loc_445><loc_206>[19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1</list_item>
 <list_item><loc_252><loc_208><loc_445><loc_234>[20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2</list_item>
 <list_item><loc_252><loc_236><loc_445><loc_276>[21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1</list_item>
-<list_item><loc_252><loc_278><loc_445><loc_352>[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</list_item>
+<list_item><loc_252><loc_278><loc_445><loc_352>[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6</list_item>
 <list_item><loc_252><loc_355><loc_445><loc_394>[23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1</list_item>
 <list_item><loc_252><loc_396><loc_445><loc_422>[24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3</list_item>
 <list_item><loc_252><loc_424><loc_445><loc_450>[25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on</list_item>
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.json
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.json
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.md
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.md
@ -1,3 +1,5 @@
+2022
+
 ## TableFormer: Table Structure Understanding with Transformers.

 ## Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research
@ -282,16 +284,16 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
 - end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5
 - [2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3
 - [3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2
- [4] Herv´ e D´ ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
+- [4] Herv´ e D´jean, e Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2
 - [5] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J Perantonis. Automatic table detection in document images. In International Conference on Pattern Recognition and Image Analysis , pages 609-618. Springer, 2005. 2
- [6] MaxG¨ obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
+- [6] MaxG¨bel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. o Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013. 2
 - [7] EA Green and M Krishnamoorthy. Recognition of tables using table grammars. procs. In Symposium on Document Analysis and Recognition (SDAIR'95) , pages 261-277. 2
 - [8] Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. Castabdetectors: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. Journal of Imaging , 7(10), 2021. 1
 - [9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. 1
 - [10] Yelin He, X. Qi, Jiaquan Ye, Peng Gao, Yihao Chen, Bingcong Li, Xin Tang, and Rong Xiao. Pingan-vcgroup's solution for icdar 2021 competition on scientific table image recognition to latex. ArXiv , abs/2105.01846, 2021. 2
 - [11] Jianying Hu, Ramanujan S Kashi, Daniel P Lopresti, and Gordon Wilfong. Medium-independent table detection. In Document Recognition and Retrieval VII , volume 3967, pages 291-302. International Society for Optics and Photonics, 1999. 2
 - [12] Matthew Hurst. A constraint-based approach to table structure derivation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2 , ICDAR '03, page 911, USA, 2003. IEEE Computer Society. 2
- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ ement Chatelain, and Thierry Paquet. Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
+- [13] Thotreingam Kasar, Philippine Barlas, Sebastien Adam, Cl´ment Chatelain, and Thierry Paquet. e Learning to detect tables in scanned document images using line information. In 2013 12th International Conference on Document Analysis and Recognition , pages 1185-1189. IEEE, 2013. 2
 - [14] Pratik Kayal, Mrinal Anand, Harsh Desai, and Mayank Singh. Icdar 2021 competition on scientific table image recognition to latex, 2021. 2
 - [15] Harold WKuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 2(1-2):83-97, 1955. 6
 - [16] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(12):2891-2903, 2013. 4
@ -300,7 +302,7 @@ In this paper, we presented TableFormer an end-to-end transformer based approach
 - [19] Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, and Peter Staar. Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence , 35(17):15137-15145, May 2021. 1
 - [20] Rujiao Long, Wen Wang, Nan Xue, Feiyu Gao, Zhibo Yang, Yongpan Wang, and Gui-Song Xia. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 944-952, 2021. 2
 - [21] Shubham Singh Paliwal, D Vishwanath, Rohit Rahul, Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 128-133. IEEE, 2019. 1
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
+- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´-Buc, E. e Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32 , pages 8024-8035. Curran Associates, Inc., 2019. 6
 - [23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 572-573, 2020. 1
 - [24] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 142-147. IEEE, 2019. 3
 - [25] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.pages.json
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.pages.json
--- a/tests/data/groundtruth/docling_v2/2206.01062.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2206.01062.doctags.txt
@ -1,4 +1,5 @@
-<doctag><page_header><loc_15><loc_104><loc_30><loc_350>arXiv:2206.01062v1  [cs.CV]  2 Jun 2022</page_header>
+<doctag><page_header><loc_15><loc_136><loc_30><loc_350>arXiv:2206.01062v1 [cs.CV] 2 Jun</page_header>
+<text><loc_15><loc_104><loc_30><loc_129>2022</text>
 <section_header_level_1><loc_88><loc_53><loc_413><loc_75>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</section_header_level_1>
 <text><loc_74><loc_85><loc_158><loc_114>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</text>
 <text><loc_208><loc_85><loc_292><loc_114>Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com</text>
@ -23,7 +24,7 @@
 <page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
 <section_header_level_1><loc_44><loc_55><loc_128><loc_61>1 INTRODUCTION</section_header_level_1>
 <text><loc_44><loc_70><loc_248><loc_144>Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.</text>
-<text><loc_44><loc_146><loc_241><loc_317>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</text>
+<text><loc_44><loc_146><loc_241><loc_317>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</text>
 <text><loc_44><loc_319><loc_241><loc_366>In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:</text>
 <unordered_list><list_item><loc_53><loc_369><loc_241><loc_388>(1) Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.</list_item>
 <list_item><loc_53><loc_390><loc_240><loc_402>(2) Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.</list_item>
@ -110,7 +111,7 @@
 <otsl><loc_288><loc_95><loc_427><loc_193><fcel>Class-count<ched>11<lcel><ched>5<lcel><nl><fcel>Split<ched>Doc<ched>Page<ched>Doc<ched>Page<nl><rhed>Caption<fcel>68<fcel>83<ecel><ecel><nl><rhed>Footnote<fcel>71<fcel>84<ecel><ecel><nl><rhed>Formula<fcel>60<fcel>66<ecel><ecel><nl><rhed>List-item<fcel>81<fcel>88<fcel>82<fcel>88<nl><rhed>Page-footer<fcel>62<fcel>89<ecel><ecel><nl><rhed>Page-header<fcel>72<fcel>90<ecel><ecel><nl><rhed>Picture<fcel>72<fcel>82<fcel>72<fcel>82<nl><rhed>Section-header<fcel>68<fcel>83<fcel>69<fcel>83<nl><rhed>Table<fcel>82<fcel>89<fcel>82<fcel>90<nl><rhed>Text<fcel>85<fcel>91<fcel>84<fcel>90<nl><rhed>Title<fcel>77<fcel>81<ecel><ecel><nl><rhed>All<fcel>72<fcel>84<fcel>78<fcel>87<nl></otsl>
 <text><loc_260><loc_209><loc_457><loc_263>lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items), the label set of size 4 is the closest to PubLayNet, in the assumption that the List is down-mapped to Text in PubLayNet. The results in Table 3 show that the prediction accuracy on the remaining class labels does not change significantly when other classes are merged into them. The overall macro-average improves by around 5%, in particular when Page-footer and Page-header are excluded.</text>
 <section_header_level_1><loc_260><loc_272><loc_449><loc_277>Impact of Document Split in Train and Test Set</section_header_level_1>
-<text><loc_259><loc_281><loc_457><loc_376>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</text>
+<text><loc_259><loc_281><loc_457><loc_376>Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.</text>
 <section_header_level_1><loc_260><loc_385><loc_342><loc_390>Dataset Comparison</section_header_level_1>
 <text><loc_260><loc_394><loc_457><loc_447>Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,</text>
 <page_break>
--- a/tests/data/groundtruth/docling_v2/2206.01062.json
+++ b/tests/data/groundtruth/docling_v2/2206.01062.json
--- a/tests/data/groundtruth/docling_v2/2206.01062.md
+++ b/tests/data/groundtruth/docling_v2/2206.01062.md
@ -1,3 +1,5 @@
+2022
+
 ## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

 Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com
@ -44,7 +46,7 @@ Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staa

 Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.

-Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.
+Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or LT E X A sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.

 In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:

@ -98,7 +100,7 @@ The annotation campaign was carried out in four phases. In phase one, we identif
 Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row 'Total') in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.

 |                |         | % of Total   | % of Total   | % of Total   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   | triple inter-annotator mAP @0.5-0.95 (%)   |
-|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
+|----------------|---------|--------------|--------------|--------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|
 | class label    | Count   | Train        | Test         | Val          | All                                        | Fin                                        | Man                                        | Sci                                        | Law                                        | Pat                                        | Ten                                        |
 | Caption        | 22524   | 2.04         | 1.77         | 2.32         | 84-89                                      | 40-61                                      | 86-92                                      | 94-99                                      | 95-99                                      | 69-78                                      | n/a                                        |
 | Footnote       | 6318    | 0.60         | 0.31         | 0.58         | 83-91                                      | n/a                                        | 100                                        | 62-88                                      | 85-94                                      | n/a                                        | 82-97                                      |
@ -233,7 +235,7 @@ lists in PubLayNet (grouped list-items) versus DocLayNet (separate list-items),

 ## Impact of Document Split in Train and Test Set

-Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 0% in mAP over the document-wise splitting. 1 Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.
+Many documents in DocLayNet have a unique styling. In order to avoid overfitting on a particular style, we have split the train-, test- and validation-sets of DocLayNet on document boundaries, i.e. every document contributes pages to only one set. To the best of our knowledge, this was not considered in PubLayNet or DocBank. To quantify how this affects model performance, we trained and evaluated a Mask R-CNN R50 model on a modified dataset version. Here, the train-, test- and validation-sets were obtained by a randomised draw over the individual pages. As can be seen in Table 4, the difference in model performance is surprisingly large: pagewise splitting gains ˜ 10% in mAP over the document-wise splitting. Thus, random page-wise splitting of DocLayNet can easily lead to accidental overestimation of model performance and should be avoided.

 ## Dataset Comparison

--- a/tests/data/groundtruth/docling_v2/2206.01062.pages.json
+++ b/tests/data/groundtruth/docling_v2/2206.01062.pages.json
--- a/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json
--- a/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.pages.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.pages.json
--- a/tests/data/groundtruth/docling_v2/2305.03393v1.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.doctags.txt
@ -1,4 +1,5 @@
-<doctag><page_header><loc_15><loc_104><loc_30><loc_350>arXiv:2305.03393v1  [cs.CV]  5 May 2023</page_header>
+<doctag><page_header><loc_15><loc_136><loc_30><loc_350>arXiv:2305.03393v1 [cs.CV] 5 May</page_header>
+<text><loc_15><loc_104><loc_30><loc_129>2023</text>
 <section_header_level_1><loc_110><loc_73><loc_393><loc_92>Optimized Table Tokenization for Table Structure Recognition</section_header_level_1>
 <text><loc_114><loc_107><loc_389><loc_126>Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]</text>
 <text><loc_188><loc_123><loc_244><loc_129>and Peter Staar</text>
--- a/tests/data/groundtruth/docling_v2/2305.03393v1.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.json
--- a/tests/data/groundtruth/docling_v2/2305.03393v1.md
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.md
@ -1,3 +1,5 @@
+2023
+
 ## Optimized Table Tokenization for Table Structure Recognition

 Maksym Lysak [0000 - 0002 - 3723 - 6960] , Ahmed Nassar [0000 - 0002 - 9468 - 0822] , Nikolaos Livathinos [0000 - 0001 - 8513 - 3491] , Christoph Auer [0000 - 0001 - 5761 - 0422] , [0000 - 0002 - 8088 - 0823]
--- a/tests/data/groundtruth/docling_v2/2305.03393v1.pages.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.pages.json
--- a/tests/data/groundtruth/docling_v2/amt_handbook_sample.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/amt_handbook_sample.doctags.txt
@ -1,17 +1,17 @@
-<doctag><text><loc_61><loc_30><loc_262><loc_59>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</text>
-<text><loc_61><loc_70><loc_262><loc_116>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless  steel  self-locking  nuts,  representing  the  all-metal types; and the elastic stop nut, representing the fiber insert type.</text>
-<section_header_level_1><loc_61><loc_127><loc_141><loc_132>Boots Self-Locking Nut</section_header_level_1>
-<text><loc_61><loc_136><loc_262><loc_182>nut  is  of  one  piece,  all-metal The  Boots  self-locking construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</text>
-<text><loc_61><loc_193><loc_262><loc_238>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</text>
-<text><loc_61><loc_249><loc_262><loc_311>The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</text>
-<text><loc_61><loc_322><loc_262><loc_335>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</text>
-<picture><loc_59><loc_343><loc_261><loc_449><caption><loc_61><loc_455><loc_155><loc_460>Figure 7-26. Self-locking nuts.</caption></picture>
-<text><loc_270><loc_30><loc_472><loc_76>the most common ranges in size for No. 6 up to  1 4 inch, the / Rol-top ranges from  1 4 inch to / 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</text>
-<text><loc_270><loc_78><loc_274><loc_84>.</text>
-<section_header_level_1><loc_270><loc_86><loc_380><loc_92>Stainless Steel Self-Locking Nut</section_header_level_1>
-<text><loc_270><loc_96><loc_472><loc_230>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt  easily,  because  the  threaded  insert  is  the  proper  size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</text>
-<section_header_level_1><loc_270><loc_241><loc_327><loc_246>Elastic Stop Nut</section_header_level_1>
+<doctag><text><loc_61><loc_30><loc_260><loc_59>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</text>
+<text><loc_61><loc_70><loc_260><loc_116>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; andthe elastic stop nut, representing the fiber insert type.</text>
+<section_header_level_1><loc_61><loc_127><loc_139><loc_132>Boots Self-Locking Nut</section_header_level_1>
+<text><loc_61><loc_136><loc_260><loc_182>The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</text>
+<text><loc_61><loc_193><loc_260><loc_238>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</text>
+<text><loc_61><loc_249><loc_260><loc_311>The spring, through the mediumofthe locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</text>
+<text><loc_61><loc_322><loc_260><loc_335>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</text>
+<picture><loc_59><loc_343><loc_261><loc_449><caption><loc_61><loc_455><loc_153><loc_460>Figure 7-26. Self-locking nuts.</caption></picture>
+<text><loc_270><loc_30><loc_470><loc_76>the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</text>
+<text><loc_270><loc_78><loc_272><loc_84>.</text>
+<section_header_level_1><loc_270><loc_86><loc_378><loc_92>Stainless Steel Self-Locking Nut</section_header_level_1>
+<text><loc_270><loc_96><loc_470><loc_230>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</text>
+<section_header_level_1><loc_270><loc_241><loc_325><loc_246>Elastic Stop Nut</section_header_level_1>
 <text><loc_270><loc_250><loc_470><loc_264>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</text>
-<picture><loc_270><loc_272><loc_470><loc_447><caption><loc_270><loc_454><loc_405><loc_459>Figure 7-27. Stainless steel self-locking nut.</caption></picture>
-<page_footer><loc_453><loc_472><loc_472><loc_478>7-45</page_footer>
+<picture><loc_270><loc_272><loc_470><loc_447><caption><loc_270><loc_454><loc_404><loc_459>Figure 7-27. Stainless steel self-locking nut.</caption></picture>
+<page_footer><loc_453><loc_472><loc_470><loc_478>7-45</page_footer>
 </doctag>
--- a/tests/data/groundtruth/docling_v2/amt_handbook_sample.json
+++ b/tests/data/groundtruth/docling_v2/amt_handbook_sample.json
--- a/Show More
+++ b/Show More