* fix(xlsx): deal with chartsheets in workbooks
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* tests(xlsx): align test file names
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* Fix for the proper headers support in rich tables in HTML
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Compatibility with older Python versions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixing Furniture before the first heading rule
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added minimalistic test case
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* added html for the test
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* add table raw cells when no table structure model was used
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Add RichTableCell instance for tables with missing structure.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* add table raw cells when no table structure model was used
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Add RichTableCell instance for tables with missing structure.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* Rich tables support for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Decoupling JATS backend from HTML backend, ways of creating tables changed significantly
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated and added tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Refactored parse_table_data in html_backend into few smaller functions
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Changing scope of few functions in html_backend.py, making them static, when possible
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for HTML tables that have tbody and/or thead, now these tables are also properly supported
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* feat: add a backend parser for WebVTT files
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs: update README with VTT support
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* docs: add description to supported formats
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore: upgrade docling-core to unescape WebVTT in markdown
Pin the new release of docling-core 2.48.2.
Do not escape HTML reserved characters when exporting WebVTT documents to markdown.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* test: add missing copyright notice
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* updated the backend and pyproject.toml
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the version and test files
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* forgot to add 1 updated test-file
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* feat: Switch default layout model to DOCLING_LAYOUT_HERON. Update the unit test data.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Use default layout model in model_downloader default args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use default layout model in model_downloader default args
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docling-models tag for TableFormer
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT (from linux CPU)
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
* fix: Ensure that the visualisations happen on copies of the page image
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Pinpoint docling-ibm-models to the fix branch for the ReadingOrderPredictor
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Update uv.lock
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Update tests GT to match the Heron layout model and the improved reading order model in Linux
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Introduce the verify_doctags optional parameter in conversion tests to control if a doctags
comparison should take place. Skip doctags comparisons for certain tests.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Generate tests GT on Mac
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Remove the pinning of the docling-ibm-models and use the release 3.9.1
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-253.eu-central-1.compute.internal>
* chore(html): refactor parser to leverage context managers
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(html): parse inline code snippets, also from list items
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* chore(html): remove hidden tags
Remove tags that are not meant to be displayed.
Add regression tests for code blocks, inline code, and hidden tags.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): parse footer tag as a section in furniture
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* fix(HTML): add test for body vs furniture in HTML parser.
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* re-implement links for html backend.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* fix inline groups in list items. write specific test for find_parent_annotation of _extract_text_and_hyperlink_recursively.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
* implement hack for images.
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
---------
Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
fix(HTML): ensure correct concatenation of child strings in table cells and list items
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Fix a bug in parsing HTML tables in HTML backend.
Fix a bug in test file that prevented JATS backend tests.
Ensure that the JATS backend creates headings with the right level.
Remove unnecessary data files for testing JATS backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Update tests to use default PDF backend (DPv4)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* OCR tests use DPv1 until rotation bugs are fixed
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Establish layout_model spec and example instantations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated naming
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Back to uppercase constants
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix deps issue with openai-whipser>numba>llvmlite
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pull v1 changed test GT from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Integrate ListItemMarkerProcessor into document assembly
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update to final version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update all test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade deps
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
* fix: extraneous empty paragraphs for test files
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
---------
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com>
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>
The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values.
Refactor static methods as such and add the staticmethod decorator.
Extend the regression test for this fix.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Make page.parsed_page the only source of truth for text cells
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Small fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Correctly compute PDF boxes from pymupdf
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use different OCR engine order
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add type hints and fix mypy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* One more test fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove with pypdfium2_lock from caller sites
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix typing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>