Files
docling/docling
mohammed ahmed 8820b5558b perf: speed up function _parse_orientation (#1934)
* ️ Speed up function `_parse_orientation` by 242%
Here’s how you should rewrite the code for **maximum speed** based on your profiler.

- The _bottleneck_ is the line  
  ```python
  orientations = df_osd.loc[df_osd["key"] == "Orientation in degrees"].value.tolist()
  ```
  This does a dataframe filtering (`loc`) and then materializes a list for every call, which is slow.

- We can **vectorize** this search (avoid repeated boolean masking and conversion).
    - Instead of `.loc[df_osd["key"] == ...].value.tolist()`, use `.at[idx, 'value']` where `idx` is the first index where key matches, or better, `.values[0]` after a fast boolean mask.  
    - Since you only use the *first* matching value, you don’t need the full filtered column.

- You can optimize `parse_tesseract_orientation` by.
    - Storing `CLIPPED_ORIENTATIONS` as a set for O(1) lookup if it isn't already (can't change the global so just memoize locally).
    - Remove unnecessary steps.

**Here is your optimized code:**



**Why is this faster?**

- `_fast_get_orientation_value`:  
  - Avoids all index alignment overhead of `df.loc`.
  - Uses numpy arrays under the hood (thanks to `.values`) for direct boolean masking and fast nonzero lookup.
  - Fetches just the first match directly, skipping conversion to lists.
- Only fetches and processes the single cell you actually want.

**If you’re sure there’s always exactly one match:**  
You can simplify `_fast_get_orientation_value` to.



Or, if always sorted and single.


---

- **No semantics changed.**
- **Comments unchanged unless part modified.**

This approach should reduce the time spent in `_parse_orientation()` by almost two orders of magnitude, especially as the DataFrame grows.  
Let me know if you want further micro-optimizations (e.g., Cython, pre-fetched numpy conversions, etc.)!

* fix: pandas vet error

* DCO Remediation Commit for mohammed <mohammed18200118@gmail.com>

I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: d9824749bb

Signed-off-by: mohammed <mohammed18200118@gmail.com>

* Dummy commit to trigger CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: mohammed <mohammed18200118@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-08-19 10:55:18 +02:00
..
2024-07-15 09:42:42 +02:00
2024-12-06 13:42:14 +01:00