When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

This GitHub repository contains the code used for conducting experiments in our paper:

In the ./clip_features directory, you will find experiments conducted for the openai/clip-vit-large-patch14-336 model, while the ./siglip_features directory contains experiments for the google/siglip-so400m-patch14-384 model.

To understand the core idea of the work, consider the following observation: Any transformer-based visual encoders, such as CLIP, DINO, SigLIP, EVA, and others, start by dividing the image into patches and then perform several iterations of information exchange between them. The first attention layer performs the first iteration, the second layer performs the second iteration, and so on. It can be assumed that after several such steps, each crop contains information about some of its neighbors, including crops that are entirely or almost entirely composed of the same information. Thus, some of the crops (and their corresponding vectors) can be discarded without losing useful information.

Objectives of the Work:

Learn to identify these redundant features.
Investigate how their removal affects the performance of VLM models.

Results:

We demonstrated that our method allows for the removal of half of the features in OCR-like tasks without losing quality in the answers across all benchmarks we tested.
We also compared our method of pruning redundant features with randomly dropping the same percentage of features. This experiment showed a significant superiority of our method over random pruning. In other words, we have indeed learned to discard features with redundant information. The results for the LLaVA-NeXT model can be seen in the graph below or in more detail in the paper.

Additionally, this approach turns out to be highly interpretable, as can be seen in the preview image on this page, in the graph below or in other examples in the paper.

GitHub Guide:

In the ./clip_features directory, you will find (for ./siglip_features, everything is similar):

feature_selection - Contains all the code necessary for training the feature selector.
dataset_generation - The selector is trained on a dataset consisting of image features only. In other words, you need to take many images, run, for example, CLIP on each, obtain the CLIP features for each image, and store all the features in one folder—this is our dataset. This folder contains a notebook where this is done.
LLaVA-NeXT-with-feature-selection - To test the effectiveness of the selector, we integrated it into the model's code. After the vision tower processes the image and obtains the features, our selector is applied to keep only the necessary features, and LLaVA works with them. Important: We do not perform any fine-tuning; we simply take the LLaVA weights from Hugging Face and slightly modify the code so that it works with the features selected by our selector. We do not tune the language model at all. And it works!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
clip_features		clip_features
images_for_examples		images_for_examples
images_for_representation		images_for_representation
siglip_features		siglip_features
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Objectives of the Work:

Results:

GitHub Guide:

About

Uh oh!

Releases

Packages

Languages

FusionBrainLab/FeatureSelection

Folders and files

Latest commit

History

Repository files navigation

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Objectives of the Work:

Results:

GitHub Guide:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages