diff --git a/docs/examples/gpu_standard_pipeline.py b/docs/examples/gpu_standard_pipeline.py index 12188e25..5806408a 100644 --- a/docs/examples/gpu_standard_pipeline.py +++ b/docs/examples/gpu_standard_pipeline.py @@ -1,3 +1,20 @@ +# %% [markdown] +# +# What this example does +# - Run a conversion using the best setup for GPU for the standard pipeline +# +# Requirements +# - Python 3.9+ +# - Install Docling: `pip install docling` +# +# How to run +# - `python docs/examples/gpu_standard_pipeline.py` +# +# This example is part of a set of GPU optimization strategies. Read more about it in [GPU support](../../usage/gpu/) +# +# ## Example code +# %% + import datetime import logging import time diff --git a/docs/examples/gpu_vlm_pipeline.py b/docs/examples/gpu_vlm_pipeline.py index 254fb571..ce657065 100644 --- a/docs/examples/gpu_vlm_pipeline.py +++ b/docs/examples/gpu_vlm_pipeline.py @@ -1,3 +1,32 @@ +# %% [markdown] +# +# What this example does +# - Run a conversion using the best setup for GPU using VLM models +# +# Requirements +# - Python 3.10+ +# - Install Docling: `pip install docling` +# - Install vLLM: `pip install vllm` +# +# How to run +# - `python docs/examples/gpu_vlm_pipeline.py` +# +# This example is part of a set of GPU optimization strategies. Read more about it in [GPU support](../../usage/gpu/) +# +# ### Start models with vllm +# +# ```console +# vllm serve ibm-granite/granite-docling-258M \ +# --host 127.0.0.1 --port 8000 \ +# --max-num-seqs 512 \ +# --max-num-batched-tokens 8192 \ +# --enable-chunked-prefill \ +# --gpu-memory-utilization 0.9 +# ``` +# +# ## Example code +# %% + import datetime import logging import time diff --git a/docs/usage/gpu.md b/docs/usage/gpu.md index 012f6568..eda52772 100644 --- a/docs/usage/gpu.md +++ b/docs/usage/gpu.md @@ -126,18 +126,38 @@ TBA. ## Performance results -Test data: -- Number of pages: 192 -- Number of tables: 95 +### Test data -Test infrastructure: -- Instance type: `g6e.2xlarge` -- CPU: 8 vCPUs, AMD EPYC 7R13 -- RAM: 64GB -- GPU: NVIDIA L40S 48GB -- CUDA Version: 13.0, Driver Version: 580.95.05 +| | PDF doc | [ViDoRe V3 HR](https://huggingface.co/datasets/vidore/vidore_v3_hr) | +| - | - | - | +| Num docs | 1 | 14 | +| Num pages | 192 | 1110 | +| Num tables | 95 | 258 | +| Format type | PDF | Parquet of images | -| Pipeline | Page efficiency | -| - | - | -| Standard - Inline | 3.1 pages/second | -| VLM - Inference server (GraniteDocling) | 2.4 pages/second | + +### Test infrastructure + +| | g6e.2xlarge | RTX 5090 | RTX 5070 | +| - | - | - | - | +| Description | AWS instance `g6e.2xlarge` | Linux bare metal machine | Windows 11 bare metal machine | +| CPU | 8 vCPUs, AMD EPYC 7R13 | 16 vCPU, AMD Ryzen 7 9800 | 16 vCPU, AMD Ryzen 7 9800 | +| RAM | 64GB | 128GB | 64GB | +| GPU | NVIDIA L40S 48GB | NVIDIA GeForce RTX 5090 | NVIDIA GeForce RTX 5070 | +| CUDA Version | 13.0, driver 580.95.05 | 13.0, driver 580.105.08 | 13.0, driver 581.57 | + + +### Results + + + + + + + + + + +
Pipelineg6e.2xlargeRTX 5090RTX 5070
PDF docViDoRe V3 HRPDF docViDoRe V3 HRPDF docViDoRe V3 HR
Standard - Inline (no OCR)3.1 pages/second-7.9 pages/second
[cpu-only]* 1.5 pages/second
-4.2 pages/second
[cpu-only]* 1.2 pages/second
-
VLM - Inference server (GraniteDocling)2.4 pages/second-3.8 pages/second3.6-4.5 pages/second--
+ +_* cpu-only timing computed with 16 pytorch threads._