feat: export document pages as multimodal output (#54)

* feat: export document pages as multimodal output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* create a single parquet output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add loading into HF datasets library

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cleanup

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
Michele Dolfi
2024-09-03 15:05:35 +02:00
committed by GitHub
parent 69e5d951a3
commit 1de2e4f924
5 changed files with 1025 additions and 7 deletions

View File

@@ -71,6 +71,15 @@ class BoundingBox(BaseModel):
return out_bbox
def normalized(self, page_size: PageSize) -> "BoundingBox":
out_bbox = copy.deepcopy(self)
out_bbox.l /= page_size.width
out_bbox.r /= page_size.width
out_bbox.t /= page_size.height
out_bbox.b /= page_size.height
return out_bbox
def as_tuple(self):
if self.coord_origin == CoordOrigin.TOPLEFT:
return (self.l, self.t, self.r, self.b)