* Prepare existing codes for use with new multi-stage VLM pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add multithreaded VLM pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add VLM task interpreters
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add VLM task interpreters
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove prints
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix KeyboardInterrupt behaviour
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add VLLM backend support, optimize process_images
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Tweak defaults
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Implement proper batch inference for HuggingFaceTransformersVlmModel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Small fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup hf_transformers_model batching impl
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust example instatiation of multi-stage VLM pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add GoT OCR 2.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Factor out changes without multi-stage pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Reset defaults for generation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add torch.compile, fix temperature setting in gen_kwargs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Expose page_batch_size on CLI
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add torch_dtype bfloat16 to SMOLDOCLING and SMOLVLM model spec
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clip off pad_token
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Notebook showing example on how to use docling transforms in DPK
Signed-off-by: Maroun Touma <touma@us.ibm.com>
* fix HF Token name
Signed-off-by: Maroun Touma <touma@us.ibm.com>
* use %pip instead of pip install jupyter lab
Signed-off-by: Maroun Touma <touma@us.ibm.com>
* run formatter
Signed-off-by: Maroun Touma <touma@us.ibm.com>
* add example to mkdocs and fix typo
Signed-off-by: Maroun Touma <touma@us.ibm.com>
---------
Signed-off-by: Maroun Touma <touma@us.ibm.com>
* fix(HTML): parse footer tag as a section in furniture
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): add test for body vs furniture in HTML parser.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* ⚡️ Speed up function `_parse_orientation` by 242%
Here’s how you should rewrite the code for **maximum speed** based on your profiler.
- The _bottleneck_ is the line
```python
orientations = df_osd.loc[df_osd["key"] == "Orientation in degrees"].value.tolist()
```
This does a dataframe filtering (`loc`) and then materializes a list for every call, which is slow.
- We can **vectorize** this search (avoid repeated boolean masking and conversion).
- Instead of `.loc[df_osd["key"] == ...].value.tolist()`, use `.at[idx, 'value']` where `idx` is the first index where key matches, or better, `.values[0]` after a fast boolean mask.
- Since you only use the *first* matching value, you don’t need the full filtered column.
- You can optimize `parse_tesseract_orientation` by.
- Storing `CLIPPED_ORIENTATIONS` as a set for O(1) lookup if it isn't already (can't change the global so just memoize locally).
- Remove unnecessary steps.
**Here is your optimized code:**
**Why is this faster?**
- `_fast_get_orientation_value`:
- Avoids all index alignment overhead of `df.loc`.
- Uses numpy arrays under the hood (thanks to `.values`) for direct boolean masking and fast nonzero lookup.
- Fetches just the first match directly, skipping conversion to lists.
- Only fetches and processes the single cell you actually want.
**If you’re sure there’s always exactly one match:**
You can simplify `_fast_get_orientation_value` to.
Or, if always sorted and single.
---
- **No semantics changed.**
- **Comments unchanged unless part modified.**
This approach should reduce the time spent in `_parse_orientation()` by almost two orders of magnitude, especially as the DataFrame grows.
Let me know if you want further micro-optimizations (e.g., Cython, pre-fetched numpy conversions, etc.)!
* fix: pandas vet error
* DCO Remediation Commit for mohammed <mohammed18200118@gmail.com>
I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: d9824749bb
Signed-off-by: mohammed <mohammed18200118@gmail.com>
* Dummy commit to trigger CI
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: mohammed <mohammed18200118@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* re-implement links for html backend.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* implement hack for images.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* Add ability to preprocess VLM response
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
* Move response decoding to vlm options (requires inheritance to override). Per-page prompt formulation also moved to vlm options to keep api consistent.
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
---------
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
* feat: add convert_string to document-converter
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix unsupported operand type(s) for |: type and NoneType
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added tests for convert_string
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* use re to stop at first non-digit
Signed-off-by: Maroun Touma <touma@us.ibm.com>
* Allow digit in first place followed by non numerical values
Signed-off-by: Maroun Touma <touma@us.ibm.com>
* refactor to match type checker
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Maroun Touma <touma@us.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
fix(HTML): ensure correct concatenation of child strings in table cells and list items
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Use device_map for transformer models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add accelerate
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Relax accelerate min version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Make pipeline cache+init thread-safe
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>