Files
docling/docs/reading_order_patch.md
2025-11-12 13:07:43 +00:00

84 lines
3.1 KiB
Markdown
Vendored

# Reading Order Patch Documentation
## Overview
This document explains the monkey patch applied to fix a KeyError in the `docling_ibm_models.reading_order.reading_order_rb` module.
## Problem Description
Users encountered a `KeyError` when converting certain PDF files using docling. The error occurred in the reading order prediction phase:
```
KeyError: 22
File "docling_ibm_models/reading_order/reading_order_rb.py", line 366, in _init_ud_maps
self.dn_map[i].append(j)
```
The error number (22 in this example) varied depending on the PDF being processed.
## Root Cause
The `_init_ud_maps` method in `docling_ibm_models.reading_order.reading_order_rb` performs the following operations:
1. Initializes `dn_map` and `up_map` dictionaries for all page elements (indices 0 to N-1)
2. Iterates through elements to build spatial relationships
3. When processing element j, it may follow a left-to-right mapping chain:
```python
while i in state.l2r_map:
i = state.l2r_map[i]
```
4. After following the chain, it accesses `state.dn_map[i]`
The KeyError occurs when:
- The final value of `i` after following the `l2r_map` chain doesn't exist in `dn_map`
- This can happen when maps are reinitialized with different element lists
- Or when invalid mappings exist in `r2l_map`
## Solution
A monkey patch was created in `/docling/models/_reading_order_patch.py` that:
1. **Defensive checks for map access**: Before accessing `dn_map[i]` or `up_map[j]`, the patch verifies the key exists
2. **Infinite loop prevention**: Tracks the original value of `i` and breaks if a circular reference is detected
3. **Graceful degradation**: Silently skips invalid mappings instead of crashing
### Patch Application
The patch is automatically applied when the `readingorder_model` module is imported:
```python
from docling.models import _reading_order_patch
_reading_order_patch.apply_patch()
```
This ensures the fix is transparent to users and doesn't require code changes.
## Testing
Comprehensive tests were added in `tests/test_reading_order_patch.py` to verify:
1. **Patch application**: Confirms the monkey patch is correctly applied
2. **Basic functionality**: Verifies reading order model can be initialized
3. **Defensive checks**: Tests that edge cases don't raise KeyError
4. **l2r_map chains**: Validates proper handling of left-to-right mapping chains
5. **Invalid mappings**: Ensures invalid `r2l_map` references are handled gracefully
All tests pass successfully.
## Impact
- **Minimal changes**: Only adds a monkey patch, no modifications to core docling code
- **Backward compatible**: Existing functionality is preserved
- **Transparent**: Applied automatically, no user action required
- **Safe**: Adds defensive checks without modifying core logic
## Future Work
This is a temporary workaround. The proper fix should be submitted to the upstream `docling-ibm-models` repository. Once the upstream fix is released and the dependency is updated, this monkey patch can be removed.
## References
- Issue: [Link to GitHub issue]
- External package: `docling-ibm-models` (version 3.10.2)
- Affected file: `docling_ibm_models/reading_order/reading_order_rb.py`