Christoph Auer
a66c4ee8eb
Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-10-15 14:58:10 +02:00
Christoph Auer
27f4ed3620
Enable mypy and fix many reported errors
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-15 14:58:00 +02:00
Maxim Lysak
115435a835
Fixes for lists handling in docx
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
2024-10-15 14:33:37 +02:00
Christoph Auer
fa5d972291
Merge remaining changes from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-15 10:52:16 +02:00
Christoph Auer
dac82ca7f2
Import statement updates from docling-core
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-15 10:11:10 +02:00
Christoph Auer
8710506072
Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-10-15 09:50:18 +02:00
Christoph Auer
afafb97b87
Update CLI
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-15 09:50:06 +02:00
Maxim Lysak
aa22fd31db
small corrections to pptx
2024-10-15 09:43:06 +02:00
Christoph Auer
a50ba57a1f
Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-10-14 16:36:20 +02:00
Christoph Auer
497ddb34a8
Big refactoring for legacy_document support
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-14 16:36:11 +02:00
Maxim Lysak
e87bf9ae06
Updated pptx backend, fixes issues with lists, also added more different list cases to example
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
2024-10-14 16:20:17 +02:00
Christoph Auer
6efcf0a5a5
Add image format support to PdfBackend
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-11 16:47:15 +02:00
Christoph Auer
d0fccb9342
Merge from simplify-conv-api
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-11 15:57:08 +02:00
Christoph Auer
95c1f80087
Change code to use unordered/ordered list, robustifications
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-11 14:53:38 +02:00
Panos Vagenas
136f16e85a
feat!: simplify conversion API ( #139 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-11 14:52:37 +02:00
Michele Dolfi
786b89efd9
Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-10-11 12:59:11 +02:00
Michele Dolfi
c6e1471e02
use options objects
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-11 12:58:59 +02:00
Christoph Auer
3ee97c42b2
Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction
2024-10-11 12:57:56 +02:00
Christoph Auer
52713f0cf5
Optionally produce legacy_doc
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-11 12:57:47 +02:00
Michele Dolfi
cc9bcc424d
fix generation enabled
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-11 11:49:38 +02:00
Michele Dolfi
331ab36f04
Merge remote-tracking branch 'origin/main' into cau/input-format-abstraction
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-11 11:23:04 +02:00
Christoph Auer
304d16029a
More renaming, design enrichment interface
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-11 10:21:31 +02:00
Nikos Livathinos
dae2a3b667
fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests ( #138 )
...
* feat(OCR tests): Introduce fuzziness in the text validation of OCR tests
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2024-10-11 10:21:19 +02:00
Christoph Auer
7aad3dc946
Update test cases for v2
2024-10-10 18:51:19 +02:00
Christoph Auer
cd72ea2412
Added verify_conversion_result_v2, Regenerate v1 and v2 test data
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-10 18:30:54 +02:00
Michele Dolfi
3794f8245e
add example PNG
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-10 18:29:26 +02:00
Christoph Auer
99cfea38d6
Added verify_conversion_result_v2, Regenerate v1 and v2 test data
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-10 15:37:59 +02:00
Christoph Auer
7cad290ceb
Refactor test data, legacy usage and more
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-10 13:54:44 +02:00
Maxim Lysak
da0700f959
Fixes for docx backend
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
2024-10-09 16:52:44 +02:00
Christoph Auer
b5a27386c1
Merge from main, update OCR model and test cases
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-09 16:04:19 +02:00
Christoph Auer
0dfbd0b6fc
Update examples and test cases
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-09 15:20:27 +02:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines ( #118 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com >
Co-authored-by: Peter Staar <taa@zurich.ibm.com >
2024-10-08 19:07:08 +02:00
Christoph Auer
c0447206af
Merge from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-08 14:42:33 +02:00
Maxim Lysak
1346843301
Improved docx parsing
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
2024-10-07 13:00:50 +02:00
Maxim Lysak
cefc34e8d8
Working on a first version of DOCX native backend
...
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
2024-10-04 18:19:40 +02:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models ( #120 )
...
---------
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
2024-10-03 18:42:33 +02:00
Christoph Auer
1fa7cd9855
Fundamental refactoring for multi-format support
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-01 16:54:09 +02:00
Christoph Auer
2461b56b84
Import rewrites, adapt to changes in docling-core
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-27 09:21:15 +02:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice ( #90 )
...
* Support tableformer model choice
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update datamodel structure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add test unit for table options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Ensure import backwards-compatibility for PipelineOptions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update README
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Adjust parameters on custom_convert
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Update Dockerfile
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
2024-09-26 21:37:08 +02:00
Christoph Auer
ad2bd714d4
Update GT test files for pages
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-25 15:54:55 +02:00
Christoph Auer
48d8b7bf70
Sync test data from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-25 12:26:12 +02:00
Michele Dolfi
6a03c208ec
feat: add figure in markdown ( #98 )
...
* feat: add figures in markdown
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update to new docling-core and update test results with figures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update with improved docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-24 17:28:23 +02:00
Christoph Auer
867e06f9f2
Merge from main
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-24 12:05:17 +02:00
Peter W. J. Staar
4794ce460a
fix: updated the render_as_doctags with the new arguments from docling-core ( #93 )
...
* updated the render_as_doctags with the new arguments from docling-core
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the doctags tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the README
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fix poetry lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Fix formatting problems
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* fixed the doctag export in docling/utils/export.py
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* propagate xsize and ysize
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-23 20:12:18 +02:00
Christoph Auer
abb6dddea8
Reorganize imports from docling-core
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-09-20 10:53:52 +02:00
Peter W. J. Staar
442443a102
fix: bumped the glm version and adjusted the tests ( #83 )
...
* bumped the glm version and adjusted the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fix hooks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the tests for tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-18 07:43:49 +02:00
Nikos Livathinos
fa9699fa3c
fix(tests): Adjust the test data to match the new version of LayoutPredictor ( #82 )
...
* fix(tests): Adjust the test data to match the new version of LayoutPredictor from docling-ibm-models
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
* chore: Update poetry to use `docling-ibm-models` at version `v1.2.0`
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2024-09-17 15:50:35 +02:00
Peter W. J. Staar
98990784df
feat: add docling cli ( #75 )
...
* chore: add simple convert script
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted all
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted all
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added default arg
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* use typer for the docling CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* describe output when saving
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add tests for CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add export options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-13 14:03:09 +02:00
Michele Dolfi
8aa476ccd3
test: improve typing definitions (part 1) ( #72 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-12 15:56:29 +02:00
Michele Dolfi
79932b7d69
test: check for stable obj_type ( #70 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-11 12:53:59 +02:00