diff --git a/docs/reading_order_patch.md b/docs/reading_order_patch.md new file mode 100644 index 00000000..080f96ec --- /dev/null +++ b/docs/reading_order_patch.md @@ -0,0 +1,83 @@ +# Reading Order Patch Documentation + +## Overview + +This document explains the monkey patch applied to fix a KeyError in the `docling_ibm_models.reading_order.reading_order_rb` module. + +## Problem Description + +Users encountered a `KeyError` when converting certain PDF files using docling. The error occurred in the reading order prediction phase: + +``` +KeyError: 22 + File "docling_ibm_models/reading_order/reading_order_rb.py", line 366, in _init_ud_maps + self.dn_map[i].append(j) +``` + +The error number (22 in this example) varied depending on the PDF being processed. + +## Root Cause + +The `_init_ud_maps` method in `docling_ibm_models.reading_order.reading_order_rb` performs the following operations: + +1. Initializes `dn_map` and `up_map` dictionaries for all page elements (indices 0 to N-1) +2. Iterates through elements to build spatial relationships +3. When processing element j, it may follow a left-to-right mapping chain: + ```python + while i in state.l2r_map: + i = state.l2r_map[i] + ``` +4. After following the chain, it accesses `state.dn_map[i]` + +The KeyError occurs when: +- The final value of `i` after following the `l2r_map` chain doesn't exist in `dn_map` +- This can happen when maps are reinitialized with different element lists +- Or when invalid mappings exist in `r2l_map` + +## Solution + +A monkey patch was created in `/docling/models/_reading_order_patch.py` that: + +1. **Defensive checks for map access**: Before accessing `dn_map[i]` or `up_map[j]`, the patch verifies the key exists +2. **Infinite loop prevention**: Tracks the original value of `i` and breaks if a circular reference is detected +3. **Graceful degradation**: Silently skips invalid mappings instead of crashing + +### Patch Application + +The patch is automatically applied when the `readingorder_model` module is imported: + +```python +from docling.models import _reading_order_patch +_reading_order_patch.apply_patch() +``` + +This ensures the fix is transparent to users and doesn't require code changes. + +## Testing + +Comprehensive tests were added in `tests/test_reading_order_patch.py` to verify: + +1. **Patch application**: Confirms the monkey patch is correctly applied +2. **Basic functionality**: Verifies reading order model can be initialized +3. **Defensive checks**: Tests that edge cases don't raise KeyError +4. **l2r_map chains**: Validates proper handling of left-to-right mapping chains +5. **Invalid mappings**: Ensures invalid `r2l_map` references are handled gracefully + +All tests pass successfully. + +## Impact + +- **Minimal changes**: Only adds a monkey patch, no modifications to core docling code +- **Backward compatible**: Existing functionality is preserved +- **Transparent**: Applied automatically, no user action required +- **Safe**: Adds defensive checks without modifying core logic + +## Future Work + +This is a temporary workaround. The proper fix should be submitted to the upstream `docling-ibm-models` repository. Once the upstream fix is released and the dependency is updated, this monkey patch can be removed. + +## References + +- Issue: [Link to GitHub issue] +- External package: `docling-ibm-models` (version 3.10.2) +- Affected file: `docling_ibm_models/reading_order/reading_order_rb.py`