mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 12:48:28 +00:00
feat: Optimize table extraction quality, add configuration options (#11)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This commit is contained in:
21
README.md
21
README.md
@@ -47,7 +47,9 @@ python examples/convert.py
|
||||
```
|
||||
The output of the above command will be written to `./scratch`.
|
||||
|
||||
### Enable or disable pipeline features
|
||||
### Adjust pipeline features
|
||||
|
||||
**Control pipeline options**
|
||||
|
||||
You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
|
||||
```python
|
||||
@@ -60,6 +62,23 @@ doc_converter = DocumentConverter(
|
||||
)
|
||||
```
|
||||
|
||||
**Control table extraction options**
|
||||
|
||||
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
|
||||
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
|
||||
|
||||
|
||||
```python
|
||||
|
||||
pipeline_options = PipelineOptions(do_table_structure=True)
|
||||
pipeline_options.table_structure_options.do_cell_matching = False # Uses text cells predicted from table structure model
|
||||
|
||||
doc_converter = DocumentConverter(
|
||||
artifacts_path=artifacts_path,
|
||||
pipeline_options=pipeline_options,
|
||||
)
|
||||
```
|
||||
|
||||
### Impose limits on the document size
|
||||
|
||||
You can limit the file size and number of pages which should be allowed to process per document:
|
||||
|
||||
Reference in New Issue
Block a user