Skip to content

FusionBrainLab/FeatureSelection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

This GitHub repository contains the code used for conducting experiments in our paper:

In the ./clip_features directory, you will find experiments conducted for the openai/clip-vit-large-patch14-336 model, while the ./siglip_features directory contains experiments for the google/siglip-so400m-patch14-384 model.

To understand the core idea of the work, consider the following observation: Any transformer-based visual encoders, such as CLIP, DINO, SigLIP, EVA, and others, start by dividing the image into patches and then perform several iterations of information exchange between them. The first attention layer performs the first iteration, the second layer performs the second iteration, and so on. It can be assumed that after several such steps, each crop contains information about some of its neighbors, including crops that are entirely or almost entirely composed of the same information. Thus, some of the crops (and their corresponding vectors) can be discarded without losing useful information.

Objectives of the Work:

  1. Learn to identify these redundant features.
  2. Investigate how their removal affects the performance of VLM models.

Results:

  1. We demonstrated that our method allows for the removal of half of the features in OCR-like tasks without losing quality in the answers across all benchmarks we tested.
  2. We also compared our method of pruning redundant features with randomly dropping the same percentage of features. This experiment showed a significant superiority of our method over random pruning. In other words, we have indeed learned to discard features with redundant information. The results for the LLaVA-NeXT model can be seen in the graph below or in more detail in the paper.

  1. Additionally, this approach turns out to be highly interpretable, as can be seen in the preview image on this page, in the graph below or in other examples in the paper.

GitHub Guide:

In the ./clip_features directory, you will find (for ./siglip_features, everything is similar):

  • feature_selection - Contains all the code necessary for training the feature selector.
  • dataset_generation - The selector is trained on a dataset consisting of image features only. In other words, you need to take many images, run, for example, CLIP on each, obtain the CLIP features for each image, and store all the features in one folder—this is our dataset. This folder contains a notebook where this is done.
  • LLaVA-NeXT-with-feature-selection - To test the effectiveness of the selector, we integrated it into the model's code. After the vision tower processes the image and obtains the features, our selector is applied to keep only the necessary features, and LLaVA works with them. Important: We do not perform any fine-tuning; we simply take the LLaVA weights from Hugging Face and slightly modify the code so that it works with the features selected by our selector. We do not tune the language model at all. And it works!

About

The work is devoted to selecting the most information-rich subset of the full set of visual features.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published