feat: Optimize table extraction quality, add configuration options (#11)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This commit is contained in:
Christoph Auer
2024-07-17 16:13:21 +02:00
committed by GitHub
parent 3e2ede8107
commit e9526bb11e
5 changed files with 87 additions and 27 deletions

View File

@@ -47,7 +47,9 @@ python examples/convert.py
```
The output of the above command will be written to `./scratch`.
### Enable or disable pipeline features
### Adjust pipeline features
**Control pipeline options**
You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
```python
@@ -60,6 +62,23 @@ doc_converter = DocumentConverter(
)
```
**Control table extraction options**
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
```python
pipeline_options = PipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = False # Uses text cells predicted from table structure model
doc_converter = DocumentConverter(
artifacts_path=artifacts_path,
pipeline_options=pipeline_options,
)
```
### Impose limits on the document size
You can limit the file size and number of pages which should be allowed to process per document: