Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning | Amit Peleg*, Naman Singh*, Matthias Hein | NeurIPS, 2025
conda create --name clic python=3.12
conda activate clic
conda install pytorch==2.2.2 torchvision==0.17.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r pip_reqs.txt
python -m spacy download en_core_web_smbash eval.sh- Choose the
architecturein the bash file. - Choose the
modelNamein the bash file.- For pre-trained non-CLIC models, use the
Pre-train keyfrom the table below. - For CLIC models, use the
CLIC FT-keyfrom the table below. - For evaluating your own checkpoints, use the
Pre-train keyfrom the training and add the argument--load_pretrained_clip path\to\ckpt\folderto the eval file.
- For pre-trained non-CLIC models, use the
- Evaluation datasets (ImageNet, COCO, SugarCrepe, SugarCrepe++, etc) need to be downloaded by the user.
- Make sure the evaluation dataset paths in local_settings are correct.
| Model name | Pre-train key | CLIC FT-key | CLIC-model HF-link |
|---|---|---|---|
| ViT-B-32-CogVLM | ViT-B-32 | HF-CLIC-ViT-B-32-224-CogVLM | HF-Link |
| ViT-B-32-PixPr-RedCaps | ViT-B-32 | HF-CLIC-ViT-B-32-224-PixPr-RedCaps | HF-Link |
| ViT-B-16-CogVLM | ViT-B-16 | HF-CLIC-ViT-B-16-224-CogVLM | HF-Link |
| ViT-L-14-CogVLM | ViT-L-14 | HF-CLIC-ViT-L-14-224-CogVLM | HF-Link |
| ViT-L-14-PixPr-RedCaps | ViT-L-14 | HF-CLIC-ViT-L-14-224-PixPr-RedCaps | HF-Link |
| CLIPA-CogVLM | CLIPA | HF-CLIC-CLIPA-ViT-L-14-224-CogVLM | HF-Link |
| CLIPA-PixPr-RedCaps | CLIPA | HF-CLIC-CLIPA-ViT-L-14-224-PixPr-RedCaps | HF-Link |
| CLIPS-CogVLM | CLIPS | HF-CLIC-CLIPS-ViT-L-14-224-CogVLM | HF-Link |
| CLIPS-PixPr-RedCaps | CLIPS | HF-CLIC-CLIPS-ViT-L-14-224-PixPr-RedCaps | HF-Link |
Note: with the correct key in modelName variable in eval.sh, the models would be downloaded automatically.
We fine-tune different models with CLIC using
- CogVLM relabelled 1M Laion samples
- RedCaps subset from the PixelProse dataset
The default location for the datasets is in the data folder.
You can change the location of each dataset in the local_settings file.
mkdir data
# Download the csv file with the images urls
wget -O data/CLIC-CogVLM-relabelled-Laion.csv https://huggingface.co/datasets/nmndeep/CLIC-CogVLM-relabelled-Laion
# Download the 1M Laion subset and create csv with the image locations
python -m assets.download_cogvlm# Download the redcaps images as described in https://huggingface.co/datasets/tomg-group-umd/pixelprose
python -m assets.download_redcaps
# Process the captions and create the csv file
# If you changed the default location, make sure to change the output path argument as well
python -m assets.create_dataset --input_file data/path/to/downloaded/csv/file.csv --output_file data/redcaps_pixelprose/redcaps_pixelprose.csv- Change the
datasetvariable in thetrigger_train.shtolaion_cogvlm/redcaps_pixelprose - Change the
modelNameandarchitecturevariables as desired intrigger_train.sh.- For the
modelNameuse thePre-train keyfrom the table above.
- For the
- Make sure the csv file paths in local_settings are correct.
- You can run training without evaluation by adding the
--no_evalargument to the training script.
bash trigger_train.shbash trigger_train_negclip.shbash trigger_train_baseline.shThis work uses code/models from:
If you find this repository useful, please consider citing our paper:
@inproceedings{peleg2025advancing,
title={Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning},
author={Peleg, Amit and Singh, Naman Deep and Hein, Matthias},
booktitle = {NeurIPS},
year = {2025}
}