fix(HTML): ensure correct concatenation of child strings in table cells and list items
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Adding plain latex equations to table cells
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Adding test files
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* Initial plan
* Fix multi-page TIFF image support
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
* add RGB conversion
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove pointless test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add multi-page TIFF test data and verification tests
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
* Revert "Add multi-page TIFF test data and verification tests"
This reverts commit 130a10e2d9.
* Proper test for 2 page tiff file
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* DCO Remediation Commit for copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 420df478f3
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: c1d722725f
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 6aa85cc933
I, copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 130a10e2d9
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: d571f36299
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 2aab66288b
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Proper test for 2 page tiff file (2)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* A new HTML backend that handles styled html (ignors it) as well as images.
Images are parsed as placeholders with a caption, if it exists.
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: vaaale <2428222+vaaale@users.noreply.github.com>
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>
* tests(HTML): re-enable test_ordered_lists
Re-enable test_ordered_lists regression test for the HTML backend since
docling-core now supports ordered lists with custom start value.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Alexander Vaagan <alexander.vaagan@gmail.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: vaaale <2428222+vaaale@users.noreply.github.com>
Co-authored-by: Alexander Vaagan <2428222+vaaale@users.noreply.github.com>
* docs: add documentation for confidence scores
Signed-off-by: Fabiano Franz <contact@fabianofranz.com>
* Increase focus on confidence grades, scores are informational only
Signed-off-by: Fabiano Franz <contact@fabianofranz.com>
* Update confidence_scores.md
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
---------
Signed-off-by: Fabiano Franz <contact@fabianofranz.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Use device_map for transformer models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add accelerate
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Relax accelerate min version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Make pipeline cache+init thread-safe
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
I, codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 3b8deae9ce
I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: bd8b1c42d4
I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: 7b84668e63
I, mohammed <mohammed18200118@gmail.com>, hereby add my Signed-off-by to this commit: ad90f337bc
Signed-off-by: mohammed <mohammed18200118@gmail.com>
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Initial plan
* Fix granite vision model URL from preview to stable version
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
* Update to granite vision 3.3
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Update to granite vision 3.3 (2)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>
* Update tests to use default PDF backend (DPv4)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* OCR tests use DPv1 until rotation bugs are fixed
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Establish layout_model spec and example instantations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated naming
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Back to uppercase constants
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix deps issue with openai-whipser>numba>llvmlite
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pull v1 changed test GT from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Here are targeted optimizations based on the profiling output and the code.
### Major bottlenecks & optimization strategies
#### 1. `_process_special_clusters`:
- **Main bottleneck:**
- The nested loop: for each special cluster, loop through all regular clusters and compute `.bbox.intersection_over_self(special.bbox)`.
- This is `O(N*M)` for N special and M regular clusters and is by far the slowest part.
- **Optimization:**
- **Pre-index regular clusters by bounding box for fast containment:**
- Build a simple R-tree-like spatial grid (using bins, or just a fast bbox filtering pass) to filter out regular clusters that are definitely non-overlapping before running the expensive geometric calculation.
- **If spatial index unavailable:** Pre-filter regulars to those whose bbox intersects the special’s bbox (quick min/max bbox checks), greatly reducing pairwise calculations.
#### 2. `_handle_cross_type_overlaps`:
- **Similar bottleneck:** Again, checking every regular cluster for every wrapper.
- We can apply the same bbox quick-check.
#### 3. Miscellaneous.
- **`_deduplicate_cells`/`_sort_cells` optimizations:** Minor, but batch sort/unique patterns can help.
- **Avoid recomputation:** Avoid recomputing thresholds/constants in hot loops.
Below is the optimized code addressing the biggest O(N*M) loop, using fast bbox intersection check for quick rejection before expensive calculation.
We achieve this purely with local logic in the function (no external indices needed), and respect your constraint not to introduce module-level classes.
Comments in the code indicate all changes.
**Summary of changes:**
- For both `_process_special_clusters` and `_handle_cross_type_overlaps`, we avoid unnecessary `.intersection_over_self` calculations by pre-filtering clusters based on simple bbox intersection conditions (`l < rx and r > lx and t < by and b > ty`).
- This turns expensive O(N*M) geometric checks into a two-stage filter, which is extremely fast for typical bbox distributions.
- All hot-spot loops now use local variables rather than repeated attribute lookups.
- No changes are made to APIs, outputs, or major logic branches; only faster candidate filtering is introduced.
This should reduce total runtime of `_process_special_clusters` and `_handle_cross_type_overlaps` by an order of magnitude on large documents.
* Unify temperature options for Vlm models
* Dynamic prompt support with example
* DCO Remediation Commit for Shkarupa Alex <shkarupa.alex@gmail.com>
I, Shkarupa Alex <shkarupa.alex@gmail.com>, hereby add my Signed-off-by to this commit: 34d446cb98
I, Shkarupa Alex <shkarupa.alex@gmail.com>, hereby add my Signed-off-by to this commit: 9c595d574f
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
* Replace Page with SegmentedPage
* Fix example HF repo link
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Sign-off
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
* DCO Remediation Commit for Shkarupa Alex <shkarupa.alex@gmail.com>
I, Shkarupa Alex <shkarupa.alex@gmail.com>, hereby add my Signed-off-by to this commit: 1a162066dd
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
* Use lmstudio-community model
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Swap inference engine to LM Studio
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
---------
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Move expensive imports closer to usage
Signed-off-by: William Easton <bill.easton@elastic.co>
* DCO Remediation Commit for William Easton <bill.easton@elastic.co>
I, William Easton <bill.easton@elastic.co>, hereby add my Signed-off-by to this commit: 8a7412ce5b
Signed-off-by: William Easton <bill.easton@elastic.co>
* formatting fixes
Signed-off-by: William Easton <bill.easton@elastic.co>
* DCO Remediation Commit for William Easton <bill.easton@elastic.co>
I, William Easton <bill.easton@elastic.co>, hereby add my Signed-off-by to this commit: 8a7412ce5b
I, William Easton <bill.easton@elastic.co>, hereby add my Signed-off-by to this commit: 963e343250
Signed-off-by: William Easton <bill.easton@elastic.co>
* Fix baseocrmodel test issue
Signed-off-by: William Easton <bill.easton@elastic.co>
---------
Signed-off-by: William Easton <bill.easton@elastic.co>
* Integrate ListItemMarkerProcessor into document assembly
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update to final version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update all test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade deps
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* updated granite model version
* DCO Remediation Commit for Miriyala Pranay <miriyalapranay146@gmail.com>
I, Miriyala Pranay <miriyalapranay146@gmail.com>, hereby add my Signed-off-by to this commit: 5de0d5034c
Signed-off-by: Miriyala Pranay <miriyalapranay146@gmail.com>
---------
Signed-off-by: Miriyala Pranay <miriyalapranay146@gmail.com>