feat: Optimize table extraction quality, add configuration options (#11)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-12-08 12:48:28 +00:00 · 2024-07-17 16:13:21 +02:00
parent 3e2ede8107
commit e9526bb11e
5 changed files with 87 additions and 27 deletions
--- a/README.md
+++ b/README.md
@@ -47,7 +47,9 @@ python examples/convert.py
 ```
 The output of the above command will be written to `./scratch`.

-### Enable or disable pipeline features
+### Adjust pipeline features
+
+**Control pipeline options**

 You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
 ```python
@@ -60,6 +62,23 @@ doc_converter = DocumentConverter(
 )
 ```

+**Control table extraction options**
+
+You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
+This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
+
+
+```python
+
+pipeline_options = PipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.do_cell_matching = False # Uses text cells predicted from table structure model
+
+doc_converter = DocumentConverter(
+    artifacts_path=artifacts_path,
+    pipeline_options=pipeline_options,
+)
+```
+
 ### Impose limits on the document size

 You can limit the file size and number of pages which should be allowed to process per document: