3.1 KiB
Vendored
Reading Order Patch Documentation
Overview
This document explains the monkey patch applied to fix a KeyError in the docling_ibm_models.reading_order.reading_order_rb module.
Problem Description
Users encountered a KeyError when converting certain PDF files using docling. The error occurred in the reading order prediction phase:
KeyError: 22
File "docling_ibm_models/reading_order/reading_order_rb.py", line 366, in _init_ud_maps
self.dn_map[i].append(j)
The error number (22 in this example) varied depending on the PDF being processed.
Root Cause
The _init_ud_maps method in docling_ibm_models.reading_order.reading_order_rb performs the following operations:
- Initializes
dn_mapandup_mapdictionaries for all page elements (indices 0 to N-1) - Iterates through elements to build spatial relationships
- When processing element j, it may follow a left-to-right mapping chain:
while i in state.l2r_map: i = state.l2r_map[i] - After following the chain, it accesses
state.dn_map[i]
The KeyError occurs when:
- The final value of
iafter following thel2r_mapchain doesn't exist indn_map - This can happen when maps are reinitialized with different element lists
- Or when invalid mappings exist in
r2l_map
Solution
A monkey patch was created in /docling/models/_reading_order_patch.py that:
- Defensive checks for map access: Before accessing
dn_map[i]orup_map[j], the patch verifies the key exists - Infinite loop prevention: Tracks the original value of
iand breaks if a circular reference is detected - Graceful degradation: Silently skips invalid mappings instead of crashing
Patch Application
The patch is automatically applied when the readingorder_model module is imported:
from docling.models import _reading_order_patch
_reading_order_patch.apply_patch()
This ensures the fix is transparent to users and doesn't require code changes.
Testing
Comprehensive tests were added in tests/test_reading_order_patch.py to verify:
- Patch application: Confirms the monkey patch is correctly applied
- Basic functionality: Verifies reading order model can be initialized
- Defensive checks: Tests that edge cases don't raise KeyError
- l2r_map chains: Validates proper handling of left-to-right mapping chains
- Invalid mappings: Ensures invalid
r2l_mapreferences are handled gracefully
All tests pass successfully.
Impact
- Minimal changes: Only adds a monkey patch, no modifications to core docling code
- Backward compatible: Existing functionality is preserved
- Transparent: Applied automatically, no user action required
- Safe: Adds defensive checks without modifying core logic
Future Work
This is a temporary workaround. The proper fix should be submitted to the upstream docling-ibm-models repository. Once the upstream fix is released and the dependency is updated, this monkey patch can be removed.
References
- Issue: [Link to GitHub issue]
- External package:
docling-ibm-models(version 3.10.2) - Affected file:
docling_ibm_models/reading_order/reading_order_rb.py