feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing (#44)

* Use docling-parse page-by-page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Propagate document_hash to PDF backends, use docling-parse 1.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* repin after more packages on pypi

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
Christoph Auer
2024-08-22 13:49:37 +02:00
committed by GitHub
parent f7c50c8b0e
commit a8c6b29a67
8 changed files with 73 additions and 51 deletions

View File

@@ -141,6 +141,8 @@ class DocumentConverter:
start_doc_time = time.time()
converted_doc = ConvertedDocument(input=in_doc)
_log.info(f"Processing document {in_doc.file.name}")
if not in_doc.valid:
converted_doc.status = ConversionStatus.FAILURE
return converted_doc