draft for model and stages redesign

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
Michele Dolfi
2025-08-21 10:13:17 +02:00
parent 8996d612aa
commit cc66773890

254
docs/adr/adr-model-and-stages.md vendored Normal file
View File

@@ -0,0 +1,254 @@
# Stages and model runtimes
The current architecture is mixing model runtime (framework, inline/remote, etc) with the stage definition.
The actual choice is done by `kind` fields in the respective `..Options` object.
This is now getting into duplication of runtime logic. For example we have 2x implementation for running vision models from transformers, 2x implementation for running via api inference servers, etc.
Requirements for the proposed changes
1. make more code reusable
2. provide easy presets
3. allow to run custom (model) choices without code changes, e.g. a different model which can do the same task
4. make presets discoverable as plugins, (which can also be third-party contributions)
5. plugins should allow easy choices in clients (CLI, APIs, etc)
TODO:
- [ ] table processing example
## Proposal
### Generic model runtimes
Model runtimes are as generic as possible (but very likely some duplicates might still be there).
1. they operate only on basic objects like PIL images and only expose API for batch predictions
2. the prompt is left out of the model runtime, such that they can be reused
3. model runtime are preferably not bound to a model (but they could if very very specific)
4. model runtime could still have some intenal pre-/post-processing, but it should be limited to model internals, e.g. normalization of images to RGB.
Open questions:
a. should __init__ load the models or we prefer lazy loading?
```py
class BaseModelOptions(BaseModel):
kind: str
#####
class VisionOpenAILikeApi:
def __init__(self, options):
...
def predict_batch(self, images: Iterable[PILImage], prompt: str) -> Iterable[...]:
...
@classmethod
def get_options_type(cls) -> Type[BaseModelOptions]:
return VisionOpenAILikeApiOptions
#####
class VisionHfTransformersOptions(BaseVlmOptions):
kind: Literal["vision_hf_transformers"] = "vision_hf_transformers"
repo_id: str
trust_remote_code: bool = False
load_in_8bit: bool = True
llm_int8_threshold: float = 6.0
quantized: bool = False
transformers_model_type: TransformersModelType = TransformersModelType.AUTOMODEL
transformers_prompt_style: TransformersPromptStyle = TransformersPromptStyle.CHAT
torch_dtype: Optional[str] = None
supported_devices: List[AcceleratorDevice] = [
AcceleratorDevice.CPU,
AcceleratorDevice.CUDA,
AcceleratorDevice.MPS,
]
use_kv_cache: bool = True
max_new_tokens: int = 4096
class VisionHfTransformers:
def predict_batch(self, images: Iterable[PILImage], prompt: str) -> Iterable[...]:
...
#####
class VisionMlx:
def predict_batch(self, images: Iterable[PILImage], prompt: str) -> Iterable[...]:
...
#####
class VisionVllm:
def predict_batch(self, images: Iterable[PILImage], prompt: str) -> Iterable[...]:
...
```
### Model options and instances
```py
class BaseModelOptions(BaseModel):
kind: str # needed for the options to model factory
name: str # needed for name (e.g. CLI, etc) to options instance to model factory
class VisionOpenAILikeApiOptions(BaseModelOptions):
kind: Literal["vision_openailike_api"] = "vision_openailike_api"
name: str
# Instances
QWEN_VL_OLLAMA = VisionOpenAILikeApiOptions(
name="qwen_vl_ollama",
api_url="...",
model_name="qwen_vl.."
)
SMOLDOCLING_LMSTUDIO = VisionOpenAILikeApiOptions(
name="smoldocling_lms",
api_url="...",
model_name="smoldocling.."
)
SMOLDOCLING_MLX = VisionHfTransformersOptions(
name="smoldocling_mlx",
repo_id="ds4sd/smoldocling...",
)
SMOLDOCLING_VLLM = ...
```
### Model factories
Level 1: class names
- From Type[BaseModelOptions] --> Model
- No Enum of kind/names, because these options will have mandatory arguments (api_url, repo_id, etc)
Level 2: instance names
- From the name of the instance
- Expose enum of all names to be used in CLI, etc
### Stage definition
Stages are responsible for
1. **Pre-process** the input (DoclingDocument, Page batch, etc) to the more generic format
that can be consumed by the models
2. **Post-process** the output of the models into the format it should be saved back
The stage options are linking together:
1. which stage and its own settings, e.g. the model prompt to use
2. `model_options` used to get the model form the factory
3. `model_interpreter_options` used to interpret the model raw response, which depend on the use-case, so it is independent from the model runtime.
- we could have each stage (or the ones needing it) define their own factory, but also a shared one should be enough.
```py
## Base classes (options, etc)
class BaseStageOptions(BaseModel):
kind: str
model_options: BaseModelOptions
model_interpreter_options # in the base clas
## Helper base classes
class BaseDocItemImageEnrichment:
labels: list[DocItemLabel] # ..or with a simple filter callable (like now)
image_scale: float
expansion_factor: float
...
## Actual stages
class PictureDescriptionOptions(BaseStageOptions):
kind: Literal["picture_description"] = "picture_description"
model_options: BaseModelOptions = ... # default choice, fully instanciated
... # other options
class PictureDescription(BaseDocItemImageEnrichment):
labels = [PictureItem]
...
def __init__(self, options, ...):
...
class CodeUnderstanding(BaseDocItemImageEnrichment):
labels = [CodeItem]
...
def __init__(self, options, ...):
...
class VisionConvertOptions(BaseStageOptions):
kind: Literal["picture_description"] = "vision_converter"
model_options: BaseModelOptions = ... # default choice, fully instanciated
class VisionConvert:
"""Equivalent to the VlmModel now for DocTags or Markdown"""
...
def __init__(self, options, ...):
...
```
### Usage
#### SDK
```py
# Raw inputs
pipeline_options.STAGE_options = PictureDescriptionOptions(
model_options=VisionOpenAILikeApi(
api_url="my fancy url",
model_name="qwen_vl",
),
prompt="Write a few sentences which describe in details this image. If it is a diagram also provide some numeric key highlights."
)
# Using presets
pipeline_options.STAGE_options = PictureDescriptionOptions(
model_options=model_specs.GRANITE_VISION_LMSTUDIO,
# there will be a default prompt (but not specific to the model!)
)
```
#### CLI
We could make the options use `--stage-NAME-X` or directly `--NAME-X`.
```sh
# Default options
docling --enrich-picture-description
# Change model (only from preset)
docling --enrich-picture-description \
--stage-picture-description-model=qwen_vl \
--stage-picture-description-prompt="..."
```
### Open points
Some minor open questions
1. Should we move the accelerator options in the model_options?
2. Where should the batch_size be?
### Weaknesses
Should we consider storing presets of the full Stage options? Will this quickly become too complex?
## Status
Proposed