nuridol
6efa96c983
feat: add support for ocrmac OCR engine on macOS ( #276 )
...
* feat: add support for `ocrmac` OCR engine on macOS
- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.
This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.
Signed-off-by: Suhwan Seo <nuridol@gmail.com >
* updated the poetry lock
Signed-off-by: Suhwan Seo <nuridol@gmail.com >
* Fix linting issues, update CLI docs, and add error for ocrmac use on non-Mac systems
- Resolved formatting and linting issues
- Updated `--ocr-engine` CLI option documentation for `ocrmac`
- Added RuntimeError for attempts to use `ocrmac` on non-Mac platforms
Signed-off-by: Suhwan Seo <nuridol@gmail.com >
* feat: add support for `ocrmac` OCR engine on macOS
- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.
This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.
Signed-off-by: Suhwan Seo <nuridol@gmail.com >
* docs: update examples and installation for ocrmac support
- Added `OcrMacOptions` to `custom_convert.py` and `full_page_ocr.py` examples.
- Included usage comments and examples for `OcrMacOptions` in OCR pipelines.
- Updated installation guide to include instructions for installing `ocrmac`, noting macOS version requirements (10.15+).
- Highlighted that `ocrmac` leverages Apple's Vision framework as an OCR backend.
This enhances documentation for users working on macOS to leverage `ocrmac` effectively.
Signed-off-by: Suhwan Seo <nuridol@gmail.com >
* fix: update `ocrmac` dependency with macOS-specific marker
- Added `sys_platform == 'darwin'` marker to the `ocrmac` dependency in `pyproject.toml` to specify macOS compatibility.
- Updated the content hash in `poetry.lock` to reflect the changes.
This ensures the `ocrmac` dependency is only installed on macOS systems.
Signed-off-by: Suhwan Seo <nuridol@gmail.com >
---------
Signed-off-by: Suhwan Seo <nuridol@gmail.com >
Co-authored-by: Suhwan Seo <nuridol@gmail.com >
2024-11-20 12:51:19 +01:00
Michele Dolfi
32ebf55e33
fix: propagate document limits to converter ( #388 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-11-20 08:36:51 +01:00
Shubham Gupta
3f91e7d3f1
feat: added support for exporting DocItem to an image when page image is available ( #379 )
...
* Updated minimum docling-core version to 2.4.0
Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com >
* Deprecated the generate_table_images option
Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com >
* Updated examples to use get_image instead of element.image
Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com >
---------
Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com >
2024-11-19 16:28:52 +01:00
Michele Dolfi
ed785ea122
feat: expose ocr-lang in CLI ( #375 )
...
* feat: expose ocr-lang in CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use regex for supporting multiple sep
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-11-19 15:58:49 +01:00
Peter W. J. Staar
926dfd29d5
feat: added excel backend ( #334 )
...
* feat: added excel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first msexcel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added tooling for the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working version for excel parsing of tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactor EXCEL to XLSX
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the unit tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* ran poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* adding images to output [WIP]
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the msexcel
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the msexcel (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added tests for merged cells in excel
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2024-11-19 12:21:17 +01:00
Maxim Lysak
7a97d7119f
feat: Extracting picture data for raster images found in PPTX ( #349 )
...
* Added picture data for pptx pictures
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added tests for pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Inferring image DPI from pptx file
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-18 15:22:28 +01:00
Michele Dolfi
ca8524ecae
docs: add automatic generation of CLI reference ( #325 )
...
* docs: add automatic generation of CLI reference
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* install deps for building CLI ref
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-11-15 13:18:17 +01:00
Maxim Lysak
8533039b0c
fix: Fixing images in the input Word files ( #330 )
...
* Fixing images identification in the input Word files
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Populating extracted image data into docling picture for wordx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed base64 dependency in msword_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-14 13:33:34 +01:00
Michele Dolfi
8b437adcde
fix: reduce logging by keeping option for more verbose ( #323 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-11-13 10:08:24 +01:00
Michele Dolfi
c9341bf22e
fix: skip glm model downloads ( #322 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-11-13 08:45:28 +01:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend ( #314 )
...
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-12 15:20:55 +01:00
Christoph Auer
5d4a10b121
fix: Configure env prefix for docling settings ( #315 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-11-12 10:57:16 +01:00
Nikos Livathinos
c6b3763ecb
feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning ( #290 )
...
- When the OCR is forced, any existing PDF cells are rejected.
- Introduce the force-ocr cmd parameter in docling CLI.
- Update unit tests.
- Add the full_page_ocr.py example in mkdocs.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2024-11-12 09:46:14 +01:00
Maxim Lysak
81c8243a8b
fix: Added handling of grouped elements in pptx backend ( #307 )
...
* Added handling of grouped elements in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* updated log.warn to warning
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-11 16:38:21 +01:00
Maxim Lysak
53bf2d1790
Added handling of code blocks in html with <pre> tag ( #302 )
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-11 15:00:11 +01:00
Michele Dolfi
97f214efdd
fix: allow mps usage for easyocr ( #286 )
...
* fix: allow mps usage for easyocr
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add example for cpu-only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* comment out example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-11-10 14:26:17 +01:00
Nikos Livathinos
0eb065e9b6
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr ( #282 )
...
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr without GPU if MPS is available.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2024-11-08 16:48:41 +01:00
Nikos Livathinos
704d792a79
fix(tesserocr): Raise Exception if tesserocr has not loaded any languages ( #279 )
...
fix(TesseractOcrModel): Raise Exception if tesserocr has not loaded any languages. Provide a descriptive error message.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2024-11-08 13:03:09 +01:00
Ikko Eltociear Ashimine
c3098e3c12
chore: fix typo ( #241 )
...
* chore: update pypdfium2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
* chore: update docling_parse_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
* chore: update docling_parse_v2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
---------
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
2024-11-05 16:20:04 +01:00
Michele Dolfi
40ad987303
feat: pdf backend, table mode as options and artifacts path ( #203 )
...
* feat: add more options in the CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update CLI docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* expose artifacts-path as argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-11-04 14:26:05 +01:00
Johnny Salazar
af323c04ef
fit: Specify encoding when writing output file ( #214 )
...
Specify encoding when writing output file to avoid errors when default target encoding doesn't have all characters. utf8 seems like the most universal and supported encoding. Otherwise, the cli fails with encoding errors when input file contains unicode text (basically most files nowadays) and the target system has default encoding set to some one-byte charset like cp1252
Signed-off-by: Johnny Salazar <cepera.ang@gmail.com >
2024-11-04 14:24:13 +01:00
Michele Dolfi
904d24d600
fix: allow to explicitly initialize the pipeline ( #189 )
...
* feat: allow to explicitly initialize the pipeline
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* clean examples
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-30 17:54:53 +01:00
Christoph Auer
2a2c65bf4f
feat: Add pipeline timings and toggle visualization, establish debug settings ( #183 )
...
* Add settings to turn visualization on or off
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add profiling code to all models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Refactor and fix profiling codes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Visualization codes output PNG to debug dir
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for time logging
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Optimize imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add start_timestamps to ProfilingItem
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-30 15:04:19 +01:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
...
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restructure title fix (#187 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-30 13:14:56 +01:00
Panos Vagenas
b9f5c74a7d
fix: fix header levels for DOCX & HTML ( #184 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-28 17:02:52 +01:00
Maxim Lysak
94d0729c50
fix: handling of long sequence of unescaped underscore chars in markdown ( #173 )
...
* Fix for md hanging when encountering long sequence of unescaped underscore chars
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added comment explaining reason for fix
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed trailing inline text handling (at the end of a file), and corrected underscore sequence shortening
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* making fix more rare
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-28 16:34:48 +01:00
Maxim Lysak
7d19418b77
fix: HTML backend, fixes for Lists and nested texts ( #180 )
...
* Fixes for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-25 20:14:04 +02:00
Maxim Lysak
88c1673057
fix: MD Backend, fixes to properly handle trailing inline text and emphasis in headers ( #178 )
...
* Small fix to properly handle trailing inline text in the md backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added proper handling of headers with bold, italic or emphasis
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed print
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Made smarter processing of headers, with arbitrary styling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated docling-core to 2.2.1
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests because of the change in Markdown export in docling-core
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-25 18:02:20 +02:00
Peter W. J. Staar
4116819b51
feat: Update to docling-parse v2 without history ( #170 )
...
* updated the pyproject (still need to run poetry lock after docling-parse is accepted)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Update imports for docling_parse.pdf_parser_v1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Lock docling-parse 2.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Lock docling-parse 2.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* repin poetry.lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-23 17:20:11 +02:00
Christoph Auer
3023f18ba0
feat: Support AsciiDoc and Markdown input format ( #168 )
...
* updated the base-model and added the asciidoc_backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the asciidoc backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Ensure all models work only on valid pages (#158 )
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* ci: run ci also on forks (#160 )
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
* fix: fix legacy doc ref (#162 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* docs: typo fix (#155 )
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
* feat: add coverage_threshold to skip OCR for small images (#161 )
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* chore: bump version to 2.1.0 [skip ci]
* adding tests for asciidocs
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working asciidoc parser
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* adding test_02.asciidoc
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Drafting Markdown backend via Marko library
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* work in progress on MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* md_backend produces docling document with headers, paragraphs, lists
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Improvements in md parsing
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Detecting and assembling tables in markdown in temporary buffers
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added initial docling table support to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Cleaned code, improved logging for MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixes MyPy requirements, and rest of pre-commit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed example run_md, added origin info to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* working on asciidocs, struggling with ImageRef
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* able to parse the captions and image uri's
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Update all backends with proper filename in DocumentOrigin
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update to docling-core v2.1.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for MD Backend, to avoid duplicated text inserts into docling doc
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fix styling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Added support for code blocks and fenced code in MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaned prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added proper processing of in-line textual elements for MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed issues with duplicated paragraphs and incorrect lists in pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Peter Staar <taa@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-23 16:14:26 +02:00
Michele Dolfi
3496b4838f
fix: set valid=false for invalid backends ( #171 )
...
* fix: set valid=false for invalid backends
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Add test case for InputDocument
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-23 15:52:30 +02:00
Michele Dolfi
b346faf622
feat: add coverage_threshold to skip OCR for small images ( #161 )
...
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-18 13:58:23 +02:00
Panos Vagenas
63bef59d9e
fix: fix legacy doc ref ( #162 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-18 13:11:20 +02:00
Christoph Auer
a00c937e19
Ensure all models work only on valid pages ( #158 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-18 08:54:06 +02:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 ( #117 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-16 21:02:03 +02:00
Michele Dolfi
2b1e72d327
refactor: fix type of tesseractocr options ( #140 )
...
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
2024-10-14 08:40:22 +02:00
Christoph Auer
5e4944f15f
feat: new experimental docling-parse v2 backend ( #131 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-11 15:12:49 +02:00
Nikos Livathinos
dae2a3b667
fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests ( #138 )
...
* feat(OCR tests): Introduce fuzziness in the text validation of OCR tests
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2024-10-11 10:21:19 +02:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines ( #118 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com >
Co-authored-by: Peter Staar <taa@zurich.ibm.com >
2024-10-08 19:07:08 +02:00
Fasal Shah
d412c363d7
fixed unload pdf backend resources ( #129 )
...
Signed-off-by: faisal shah <fashah@redhat.com >
Co-authored-by: faisal shah <fashah@redhat.com >
2024-10-08 10:46:43 +02:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models ( #120 )
...
---------
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
2024-10-03 18:42:33 +02:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice ( #90 )
...
* Support tableformer model choice
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update datamodel structure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add test unit for table options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Ensure import backwards-compatibility for PipelineOptions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update README
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Adjust parameters on custom_convert
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Update Dockerfile
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
2024-09-26 21:37:08 +02:00
Michele Dolfi
6a03c208ec
feat: add figure in markdown ( #98 )
...
* feat: add figures in markdown
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update to new docling-core and update test results with figures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update with improved docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-24 17:28:23 +02:00
Panos Vagenas
d96b96c848
fix: fix OCR setting for pypdfium, minor refactor ( #102 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-24 14:36:00 +02:00
Panos Vagenas
3c46e4266c
feat: add URL support to CLI ( #99 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-09-24 08:47:53 +02:00
Peter W. J. Staar
4794ce460a
fix: updated the render_as_doctags with the new arguments from docling-core ( #93 )
...
* updated the render_as_doctags with the new arguments from docling-core
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the doctags tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the README
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fix poetry lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Fix formatting problems
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* fixed the doctag export in docling/utils/export.py
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* propagate xsize and ysize
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-23 20:12:18 +02:00
Michele Dolfi
f19bd43798
feat: add table exports ( #86 )
...
* feat: expose docling-core table exporters and add examples
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove temp internal implementation of html export
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* pin latest docling-core 1.4.0 with table exports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-18 08:44:13 +02:00
Michele Dolfi
2870fdc857
fix: CLI compatibility with python 3.10 and 3.11 ( #79 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-16 12:32:45 +02:00
Peter W. J. Staar
98990784df
feat: add docling cli ( #75 )
...
* chore: add simple convert script
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted all
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted all
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added default arg
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* use typer for the docling CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* describe output when saving
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add tests for CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add export options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-13 14:03:09 +02:00
Michele Dolfi
8aa476ccd3
test: improve typing definitions (part 1) ( #72 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-12 15:56:29 +02:00