MultiModal / Vision Language Models (BETA)

Supported Models

Usage

Multimodal support is limited and doesn’t have full feature parity.

Here are the hyperparams you’ll need to use to finetune a multimodal model.

processor_type: AutoProcessor

skip_prepare_dataset: true
remove_unused_columns: false  # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false  # not yet supported with multimodal

chat_template:  # see in next section

# example dataset
datasets:
  - path: HuggingFaceH4/llava-instruct-mix-vsft
    type: chat_template
    split: train[:1%]
    field_messages: messages

# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinear

Please see examples folder for full configs.

Warning

Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.

Mllama

base_model: meta-llama/Llama-3.2-11B-Vision-Instruct

chat_template: llama3_2_vision

Llama4

base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct

chat_template: llama4

Pixtral

base_model: mistralai/Pixtral-12B-2409

chat_template: pixtral

Llava-1.5

base_model: llava-hf/llava-1.5-7b-hf

chat_template: llava

Mistral-Small-3.1

base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503

chat_template: mistral_v7_tekken

Gemma-3

Tip

The Gemma3-1B model is a text-only model, so please train as regular text model.

For multi-modal 4B/12B/27B models, use the following config:

base_model: google/gemma-3-4b-it

chat_template: gemma3

Gemma-3n

Warning

The model’s initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.

Tip

Please make sure to install timm via pip3 install timm==1.0.17

base_model: google/gemma-3n-E2B-it

chat_template: gemma3n

Qwen2-VL

base_model: Qwen/Qwen2-VL-7B-Instruct

chat_template: qwen2_vl

Qwen2.5-VL

base_model: Qwen/Qwen2.5-VL-7B-Instruct

chat_template: qwen2_vl  # same as qwen2-vl

Dataset Format

For multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.

  • A message is a list of role and content.
  • role can be system, user, assistant, etc.
  • content is a list of type and (text, image, path, url, base64, or audio).

Image

Note

For backwards compatibility:

  • If the dataset has a images or image column of list[Image], it will be appended to the first content list as {"type": "image", "image": ...}. However, if the content already has a {"type": "image"} but no image key, it will be set the image key.
  • If content is a string, it will be converted to a list with type as text.

For image loading, you can use the following keys within content alongside "type": "image":

  • "path": "/path/to/image.jpg"
  • "url": "https://example.com/image.jpg"
  • "base64": "..."
  • "image": PIL.Image

Audio

For audio loading, you can use the following keys within content alongside "type": "audio":

  • "path": "/path/to/audio.mp3"
  • "url": "https://example.com/audio.mp3"
  • "audio": np.ndarray
Tip

You may need to install librosa via pip3 install librosa==0.11.0.

Example

Here is an example of a multi-modal dataset:

[
  {
    "messages": [
        {
            "role": "system",
            "content": [
              {"type": "text", "text": "You are a helpful assistant."}
              ]
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        },
        {
            "role": "assistant",
            "content": [
              {"type": "text", "text": "The image is a bee."}
            ]
        }
    ]
  }
]

FAQ

  1. PIL.UnidentifiedImageError: cannot identify image file ...

PIL could not retrieve the file at url using requests. Please check for typo. One alternative reason is that the request is blocked by the server.