Commit Graph

11 Commits

Author SHA1 Message Date
Christoph Auer
1fa7cd9855 Fundamental refactoring for multi-format support
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-01 16:54:09 +02:00
Christoph Auer
cd06d89c2a Merge branch 'cau/experimental-format' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-09-30 13:47:57 +02:00
Christoph Auer
2461b56b84 Import rewrites, adapt to changes in docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-27 09:21:15 +02:00
Christoph Auer
95c539579d [WIP] introducting extra backend abstraction and input formats
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-25 11:17:49 +02:00
Christoph Auer
abb6dddea8 Reorganize imports from docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-20 10:53:52 +02:00
Michele Dolfi
8aa476ccd3
test: improve typing definitions (part 1) (#72)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-12 15:56:29 +02:00
Christoph Auer
a294b7e64a
feat: Page-level error reporting from PDF backend, introduce PARTIAL_SUCCESS status (#47)
* Put safety-checks for failed parse of pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce page-level error checks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump to docling-parse 1.1.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce page-level error checks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-23 16:18:41 +02:00
Christoph Auer
a8c6b29a67
feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing (#44)
* Use docling-parse page-by-page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Propagate document_hash to PDF backends, use docling-parse 1.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* repin after more packages on pypi

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 13:49:37 +02:00
Christoph Auer
e94d317c02
feat: Add adaptive OCR, factor out treatment of OCR areas and cell filtering (#38)
* Introduce adaptive OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Factor out BaseOcrModel, add docling-parse backend tests, fixes

* Make easyocr default dep

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-20 15:28:03 +02:00
Michele Dolfi
794b20a50a
fix: type of path_or_stream in PdfDocumentBackend (#28)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:20:44 +02:00
Christoph Auer
e2d996753b Initial commit 2024-07-15 09:42:42 +02:00