We introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of diverse, pre-trained task-specific experts. Prismer achieves fine-tuned and few-shot learning vision-language reasoning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data.
TMLR 2024
Large pre-trained models have demonstrated exceptional generalisation capabilities across a wide range of tasks. However, these capabilities come at a hefty cost in terms of computational resources required for training and inference, as well as the need for large amounts of training data. The problems in vision-language learning are arguably more challenging. This domain is a strict super-set of language processing, whilst also requiring extra skills unique to visual and multi-modal reasoning. A typical solution is to use a massive amount of image-text data to train one giant, monolithic model that learns to develop these task-specific skills from scratch, simultaneously, and within the same generic architecture.
Instead, we investigate an alternative approach to learn these skills and domain knowledge via distinct and separate sub-networks, referred to as "experts". As such, each expert can be optimised independently for a specific task, allowing for the use of domain-specific data and architectures that would not be feasible with a single large network. This leads to improved training efficiency, as the model can focus on integrating specialised skills and domain knowledge, rather than trying to learn everything at once, making it an effective way to scale down multi-modal learning.
To achieve this, we propose Prismer, a visually conditioned autoregressive text generation model, trained to better use diverse pre-trained domain experts for open-ended vision-language reasoning tasks. Prismer's key design elements include, i) powerful vision-only and language-only models for web-scale knowledge, and ii) multi-task vision experts encoding multiple types of visual information, including low-level vision signals such as depth, and high-level vision signals, such as instance and semantic labels, as a form of auxiliary knowledge. All expert models are individually pre-trained and frozen, and are connected through some lightweight trainable components, that only comprise roughly 20% of total network parameters.
Prismer is an encoder-decoder transformer model that leverages a library of existing pre-trained experts. It consists of a vision encoder and an auto-regressive language decoder. The vision encoder takes an RGB image and its corresponding multi-task labels as input (e.g., depth, surface normal, segmentation labels, predicted from the frozen pre-trained experts), and outputs a sequence of RGB and multi-task features. The language decoder is then conditioned on these multi-task features via cross attention, and produces a sequence of text tokens.
Prismer is designed to fully leverage pre-trained experts whilst keeping the number of trainable parameters to a minimum. To do this, the majority of the network weights of the pre-trained experts are frozen to maintain the integrity of their learned knowledge and prevent catastrophic forgetting. To link the multi-task labels as well as the vision and language parts of Prismer, we insert two types of parameter-efficient trainable components:
Experts Resampler: The Experts Resampler learns a pre-defined number of latent input queries, to cross attend a flattened embedding concatenated from all multi-task features, as inspired by the Perceiver and the Flamingo Model. The Resampler then compresses the multi-task features into a much smaller number of tokens equal to the number of the learned latent queries, as a form of auxiliary knowledge distillation.
Adaptor: The Adaptor has an encoder-decoder design, which first down-projects the input features into a smaller dimension, applies a non-linearity, and then up-projects the features back to the original input dimension. With the residual connection, we initialise all adaptors with near-zero weights to approximate the identity function. Combined with a standard cross attention block in the language decoder, the model is able to smoothly transition from the domain-specific vision-only and language-only backbones to a vision-language model during pre-training with paired image-text data.
Prismer is a generative model, trained with a single objective, to predict the next text token autoregressively. As such, we re-formulate all vision-language reasoning tasks as a language modelling or prefix language modelling problem. For example, given the input image and with its multi-task tokens and a question as the prefix, the model generates the answer for the visual question answering task; given the input image and with its multi-task tokens, the model generates its caption for the image captioning task. Once we have a prefix prompt, we may either sample the output text in an autoregressive manner, as in an open-ended generative setting; or we may rank the log-likelihood from a fixed set of completions, as in a closed-ended generative setting.
In Prismer, we include two types of pre-trained experts:
Backbone Experts: The vision-only and language-only pre-trained models, which are responsible for encoding images and texts into a meaningful sequence of tokens. Both models are required to be based on the transformer architecture, so we can easily connect them with a few trainable components of similar designs. To preserve their rich domain-specific knowledge encoded in the network parameters, the majority of the weights are frozen during pre-training.
Task Experts: The models that can produce task-specific labels depending on their training datasets. In Prismer, we include up to 6 task experts all from the vision domain, encoding three low-level vision signals: depth, surface normals, and edge; and three high-level vision signals: object labels, segmentation labels, and text labels. These task experts are treated as black-box predictors, and their predicted labels are used as input for the Prismer model. As a result, all network weights of the task experts are frozen, and they can have any design.
In addition to the Prismer, we also introduce a model variant named PrismerZ, which solely relies on the power of strong backbone experts and is trained with zero task experts. PrismerZ has the same architectural design as the original Prismer but without the Experts Resampler. PrismerZ simplifies the data inference process as it only requires RGB images, making it more efficient and applicable to a wider range of applications. Prismer is less efficient in data inference due to the need for data processing on expert labels, but it will have a better performance.
Both Prismer and PrismerZ utilise Vision Transformer pre-trained by CLIP as the frozen vision encoder, and RoBERTa as the frozen language decoder. We experiment with two model sizes, BASE and LARGE. The BASE model is built on top o ViT-B/16 and RoBERTaBASE, and the LARGE model is built on top of ViT-L/14 and RoBERTaLARGE. In Prismer, we apply the Experts Resampler of the same design with roughly 50M parameters in both model sizes. The detailed architecture details are summarised in the following Table.
Resampler | Vision Encoder | Language Decoder | Trainable Param. | Total Param. | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Layers | Width | Backbone | Layers | Width | Backbone | Layers | Width | |||
PrismerBASE | 4 | 768 | ViT-B/16 | 12 | 768 | RoBERTaBASE | 12 | 768 | 160M | 980M |
PrismerLARGE | 4 | 1024 | ViT-L/14 | 24 | 768 | RoBERTaLARGE | 24 | 1024 | 360M | 1.6B |
PrismerZBASE | - | - | ViT-B/16 | 12 | 768 | RoBERTaBASE | 12 | 768 | 105M | 275M |
PrismerZLARGE | - | - | ViT-L/14 | 24 | 768 | RoBERTaLARGE | 24 | 1024 | 270M | 870M |
We show that both Prismer and PrismerZ can achieve superior performance considering their model sizes, which suggests that the strong backbone experts are primarily responsible for good generalisation. However, the task experts provide an additional boost in performance, particularly in image captioning tasks and in the LARGE model variant. Prismer has achieved comparable image captioning performance to BLIP and LEMON, despite being trained on 10 and 20 times less data, respectively. Additionally, the Prismer has achieved VQAv2 accuracy comparable to GIT, despite being trained on 60 times less data. Whilst we acknowledge a noticeable performance gap between Prismer and the current state-of-the-art VLMs (such as CoCa, GIT-2 and PaLI), these models require substantially higher training costs and access to large-scale private training data.
Our generative pre-training approach allows for zero-shot generalisation, where the models can be directly applied to image captioning tasks without additional fine-tuning. In the following table, we show that Prismer achieves significantly better performance to SimVLM on the NoCaps dataset, whilst using 140 times less training data. Additionally, we notice that the zero-shot performance of Prismer models even surpasses the fine-tuned performance of certain VLMs such as OSCAR and VinVL, as shown in the previous section.
COCO Caption | ||||
---|---|---|---|---|
B@4 | M | C | S | |
ZeroCap | 2.6 | 11.5 | 14.6 | 5.5 |
MetaLM | 24.5 | 22.5 | 82.2 | 15.7 |
VLKD | 25.8 | 23.1 | 85.1 | 16.9 |
Flamingo | - | - | 84.3 | - |
CapDec | 26.4 | 25.1 | 91.8 | - |
Prismer | 39.5 | 30.4 | 129.7 | 23.8 |
NoCaps | ||
---|---|---|
C | S | |
FewVLM | 47.7 | 9.1 |
MetaLM | 58.7 | 8.6 |
VLKD | 63.6 | 12.8 |
SimVLMLARGE | 96.6 | - |
SimVLMHUGE | 101.4 | - |
Prismer | 107.9 | 14.8 |
We present a list of example captions generated by Prismer, along with their corresponding RGB images and task expert predictions as shown below. The results show that both PrismerBASE and PrismerLARGE are capable of generating captions that are semantically coherent and aligned with the visual content of the images. Notably, PrismerLARGE generates captions of higher quality compared to PrismerBASE, exhibiting a deep understanding of fine-grained object semantics such as brand recognition (e.g. Mercedes, CK One), and cultural concepts (e.g. vintage drawing, tango), indistinguishable to human-written captions.
Interestingly, we can easily notice that some of the expert predictions are either incorrect or not useful for image captioning. This observation motivates us to design Prismer not to overly rely on expert labels, and only consider them as auxiliary signals.
Finally, we fine-tune and evaluate Prismer on ImageNet dataset in a few-shot setting. Following the approach outlined in CLIP, we convert the classification task into a language modelling problem by mapping each unique category to a template caption: "A photo of a [CLASS NAME]". Unlike Flamingo that performs few-shot classification via in-context examples without gradient updates, we perform few-shot classification via lightweight fine-tuning following GIT. This is more similar to the standard linear probe setting, by considering the entire language decoder as an image classifier. Accordingly, we also compare with the few-shot linear probe performance of Prismer's original vision backbones ViT-B/16 and ViT-L/14.
From the results shown in the figure right, we observe that Prismer underperforms GIT and Flamingo, which both have stronger vision backbones and are pre-trained on significantly more data. However, Prismer still outperforms its original vision backbones ViT-B and ViT-L by a large margin, especially in a very few-shot setting. This suggests that Prismer's generalisation abilities are enhanced by the multi-modal training data and expert labels, and its performance can likely be improved further by using an even stronger vision backbone.
We conduct experiments to probe Prismer carefully and discover some interesting abilities. All experiments are evaluated on the VQAv2 test-dev split, with a reduced training setting.
Observation #1: More Experts, Better Performance. We observe that the performance of Prismer improves with the addition of more task experts. This is because more experts provide a greater diversity of domain knowledge to the model. However, we also note that the performance of the model eventually plateaus, which suggests that additional task experts do not provide any extra gains beyond a certain number.
Observation #2: Better Experts, Better Performance. To evaluate the impact of expert quality on Prismer's performance, we construct a corrupted depth expert by replacing a certain number of predicted depth labels with random noise sampled from a Uniform Distribution. Prismer's performance improves as the quality of the depth expert improves. This is intuitive as better experts provide more accurate domain knowledge, allowing the model to perceive more accurately.
Observation #3: Robustness to Noisy Experts. Our results also demonstrate that Prismer maintains performance even when including experts that predict noise. Interestingly, adding noise can even result in a non-trivial improvement compared to training on RGB images alone, which can be considered as a form of implicit regularisation. This property allows the model to safely include many experts without degrading the performance, even when the expert is not necessarily informative. Therefore, Prismer presents a more effective learning strategy than the standard multi-task or auxiliary learning methods, which either require exploring task relationships or designing more advanced optimisation procedures.
In this paper, we have introduced Prismer, a vision-language generative model designed for reasoning tasks. Prismer is parameter-efficient and utilises a small number of trainable components to connect an ensemble of diverse, pre-trained experts. By leveraging these experts, Prismer achieves competitive performance in image captioning, VQA, and image classification benchmarks, comparable to models trained on up to two orders of magnitude more data.
For full transparency, we now discuss some limitations of Prismer during our implementation and explore potential future directions for this work.
Multi-modal In-context Learning: Zero-shot in-context generalisation is an emergent property that only exists in very large language models. In this work, we build Prismer on top of a small-scale language model with the main focus on parameter-efficient learning. Therefore, it does not have the ability to perform few-shot in-context prompting by design.
Zero-shot Adaptation on New Experts: We experiment with inference on a pre-trained Prismer with a different segmentation expert pre-trained on a different dataset. Although we apply the same language model to encode semantic labels, Prismer shows limited adaptability to a different expert with a different set of semantic information, which leads to a notable performance drop.
Free-form Inference on Partial Experts: Similarly, we discover that Prismer entangles its multi-task features from all experts we include during pre-training. Therefore, only having a partial number of experts during inference will lead to a notable performance drop. We attempt to use a different training objective such as masked auto-encoding, to design Prismer to reason on an arbitrary number of experts, but it eventually leads to a degraded fine-tuned performance.
Representation of Expert Knowledge: In our current design of Prismer, we convert all expert labels into an image-like 3-dimensional tensor via task-specific post-processing for simplicity. There are other efficient methods to represent expert knowledge, such as converting object detection into a sequence of text tokens. This may lead to stronger reasoning performance and a more stable training landscape in future works.
If you found this work is useful in your own research, please considering citing the following.
@article{liu2024prismer,
title={Prismer: A Vision-Language Model with Multi-Task Experts},
author={Liu, Shikun and Fan, Linxi and Johns, Edward and Yu, Zhiding and Xiao, Chaowei and Anandkumar, Anima},
journal={Transactions on Machine Learning Research},
year={2024}
}
Prismer is a data-efficient vision-language model that leverages diverse pre-trained experts through its predicted multi-task signals. It can perform vision-language reasoning tasks such as image captioning and VQA.
The model name "Prismer" draws from the analogy to an optical prism which breaks a white light into a spectrum of colours, and here we break down a single reasoning task into diverse domain-specific reasoning.