MultiModal / Vision Language Models (BETA)
Supported Models
Usage
Multimodal support is limited and doesn’t have full feature parity.
Here are the hyperparams you’ll need to use to finetune a multimodal model.
processor_type: AutoProcessor
skip_prepare_dataset: true
remove_unused_columns: false # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false # not yet supported with multimodal
chat_template: # see in next section
# example dataset
datasets:
- path: HuggingFaceH4/llava-instruct-mix-vsft
type: chat_template
split: train[:1%]
field_messages: messages
# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinear
Please see examples folder for full configs.
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
Mllama
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
chat_template: llama3_2_vision
Llama4
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
chat_template: llama4
Pixtral
base_model: mistralai/Pixtral-12B-2409
chat_template: pixtral
Llava-1.5
base_model: llava-hf/llava-1.5-7b-hf
chat_template: llava
Mistral-Small-3.1
base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503
chat_template: mistral_v7_tekken
Gemma-3
The Gemma3-1B model is a text-only model, so please train as regular text model.
For multi-modal 4B/12B/27B models, use the following config:
base_model: google/gemma-3-4b-it
chat_template: gemma3
Gemma-3n
The model’s initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.
Please make sure to install timm
via pip3 install timm==1.0.17
base_model: google/gemma-3n-E2B-it
chat_template: gemma3n
Qwen2-VL
base_model: Qwen/Qwen2-VL-7B-Instruct
chat_template: qwen2_vl
Qwen2.5-VL
base_model: Qwen/Qwen2.5-VL-7B-Instruct
chat_template: qwen2_vl # same as qwen2-vl
Dataset Format
For multi-modal datasets, we adopt an extended chat_template
format similar to OpenAI’s Message format.
- A message is a list of
role
andcontent
. role
can besystem
,user
,assistant
, etc.content
is a list oftype
and (text
,image
,path
,url
,base64
, oraudio
).
Image
For backwards compatibility:
- If the dataset has a
images
orimage
column oflist[Image]
, it will be appended to the firstcontent
list as{"type": "image", "image": ...}
. However, if the content already has a{"type": "image"}
but noimage
key, it will be set theimage
key. - If
content
is a string, it will be converted to a list withtype
astext
.
For image loading, you can use the following keys within content
alongside "type": "image"
:
"path": "/path/to/image.jpg"
"url": "https://example.com/image.jpg"
"base64": "..."
"image": PIL.Image
Audio
For audio loading, you can use the following keys within content
alongside "type": "audio"
:
"path": "/path/to/audio.mp3"
"url": "https://example.com/audio.mp3"
"audio": np.ndarray
You may need to install librosa
via pip3 install librosa==0.11.0
.
Example
Here is an example of a multi-modal dataset:
[
{
"messages": [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "The image is a bee."}
]
}
]
}
]
FAQ
PIL.UnidentifiedImageError: cannot identify image file ...
PIL
could not retrieve the file at url
using requests
. Please check for typo. One alternative reason is that the request is blocked by the server.