feat: pdf backend, table mode as options and artifacts path (#203)

* feat: add more options in the CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update CLI docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* expose artifacts-path as argument

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
Michele Dolfi
2024-11-04 14:26:05 +01:00
committed by GitHub
parent af323c04ef
commit 40ad987303
3 changed files with 63 additions and 26 deletions

View File

@@ -32,30 +32,37 @@ Here are the available options as of this writing (for an up-to-date listing, ru
```console
$ docling --help
Usage: docling [OPTIONS] source
Usage: docling [OPTIONS] source
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] │
│ [required] │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from [docx|pptx|html|image|pdf] Specify input formats to convert from.
│ Defaults to all formats.
│ [default: None]
│ --to [md|json|text|doctags] Specify output formats. Defaults to
│ Markdown.
│ [default: None]
│ --ocr --no-ocr If enabled, the bitmap content will be
│ processed using OCR.
│ [default: ocr]
│ --ocr-engine [easyocr|tesseract_cli|tesseract] The OCR engine to use. [default: easyocr]
--abort-on-error --no-abort-on-error If enabled, the bitmap content will be
processed using OCR.
│ [default: no-abort-on-error]
│ --output PATH Output directory where results are saved.
[default: .]
--version Show version information.
│ --help Show this message and exit.
│ --from [docx|pptx|html|image|pdf|asciidoc|md] Specify input formats to convert from. │
Defaults to all formats. │
[default: None] │
│ --to [md|json|text|doctags] Specify output formats. Defaults to │
Markdown. │
[default: None] │
│ --ocr --no-ocr If enabled, the bitmap content will be │
processed using OCR. │
[default: ocr] │
│ --ocr-engine [easyocr|tesseract_cli|tesseract] The OCR engine to use.
[default: easyocr]
--pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use.
[default: dlparse_v1]
│ --table-mode [fast|accurate] The mode to use in the table structure
model.
[default: fast]
│ --abort-on-error --no-abort-on-error If enabled, the bitmap content will be
│ processed using OCR. │
│ [default: no-abort-on-error] │
│ --output PATH Output directory where results are │
│ saved. │
│ [default: .] │
│ --version Show version information. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
</details>