Files
docling/docs/reading_order_patch.md
2025-11-12 13:07:43 +00:00

3.1 KiB
Vendored

Reading Order Patch Documentation

Overview

This document explains the monkey patch applied to fix a KeyError in the docling_ibm_models.reading_order.reading_order_rb module.

Problem Description

Users encountered a KeyError when converting certain PDF files using docling. The error occurred in the reading order prediction phase:

KeyError: 22
  File "docling_ibm_models/reading_order/reading_order_rb.py", line 366, in _init_ud_maps
    self.dn_map[i].append(j)

The error number (22 in this example) varied depending on the PDF being processed.

Root Cause

The _init_ud_maps method in docling_ibm_models.reading_order.reading_order_rb performs the following operations:

  1. Initializes dn_map and up_map dictionaries for all page elements (indices 0 to N-1)
  2. Iterates through elements to build spatial relationships
  3. When processing element j, it may follow a left-to-right mapping chain:
    while i in state.l2r_map:
        i = state.l2r_map[i]
    
  4. After following the chain, it accesses state.dn_map[i]

The KeyError occurs when:

  • The final value of i after following the l2r_map chain doesn't exist in dn_map
  • This can happen when maps are reinitialized with different element lists
  • Or when invalid mappings exist in r2l_map

Solution

A monkey patch was created in /docling/models/_reading_order_patch.py that:

  1. Defensive checks for map access: Before accessing dn_map[i] or up_map[j], the patch verifies the key exists
  2. Infinite loop prevention: Tracks the original value of i and breaks if a circular reference is detected
  3. Graceful degradation: Silently skips invalid mappings instead of crashing

Patch Application

The patch is automatically applied when the readingorder_model module is imported:

from docling.models import _reading_order_patch
_reading_order_patch.apply_patch()

This ensures the fix is transparent to users and doesn't require code changes.

Testing

Comprehensive tests were added in tests/test_reading_order_patch.py to verify:

  1. Patch application: Confirms the monkey patch is correctly applied
  2. Basic functionality: Verifies reading order model can be initialized
  3. Defensive checks: Tests that edge cases don't raise KeyError
  4. l2r_map chains: Validates proper handling of left-to-right mapping chains
  5. Invalid mappings: Ensures invalid r2l_map references are handled gracefully

All tests pass successfully.

Impact

  • Minimal changes: Only adds a monkey patch, no modifications to core docling code
  • Backward compatible: Existing functionality is preserved
  • Transparent: Applied automatically, no user action required
  • Safe: Adds defensive checks without modifying core logic

Future Work

This is a temporary workaround. The proper fix should be submitted to the upstream docling-ibm-models repository. Once the upstream fix is released and the dependency is updated, this monkey patch can be removed.

References

  • Issue: [Link to GitHub issue]
  • External package: docling-ibm-models (version 3.10.2)
  • Affected file: docling_ibm_models/reading_order/reading_order_rb.py