-1. Your model must have a base `MyMultimodalModel` class that handles multimodal fusion without a language modeling head and a separate generative class that adds a head on top. The base model needs to implement a `get_image_features()` method that takes in image pixel values and returns encoded outputs. These will later be merged with language embeddings and thus should not require any postprocessing after. The shape of returned features has to match the number of input images. If the vision encoder returns variable-length outputs (e.g., patch-based), you can return a list of 2D tensors of size `(image_seq_len, image_dim)` - one per image.
0 commit comments