Efficient inference wrapper for the SubCell subcellular protein localization foundation model
SubCellPortable provides a streamlined interface for running the SubCell model on IF microscopy images. Generate single-cell embeddings that encode cell morphology or protein localization and predict protein subcellular localization from multi-channel fluorescence microscopy images.
π Preprint: SubCell: Subcellular protein localization foundation model (Gupta et al., 2024)
# Clone repository
git clone https://github.com/yourusername/SubCellPortable.git
cd SubCellPortable
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Prepare your input CSV (
path_list.csv):
r_image,y_image,b_image,g_image,output_prefix
images/cell_1_mt.png,,images/cell_1_nuc.png,images/cell_1_prot.png,cell1_
images/cell_2_mt.png,,images/cell_2_nuc.png,images/cell_2_prot.png,cell2_
Channel mapping:
r= microtubules (red)y= ER (yellow)b= nuclei (blue/DAPI)g= protein of interest (green)
Leave channels empty if not available (e.g., use rbg for 3-channel images)
- Configure settings (
config.yaml):
model_channels: "rybg" # Channel configuration
output_dir: "./results" # Output directory
batch_size: 128 # Batch size (adjust for GPU memory)
gpu: 0 # GPU device ID (-1 for CPU)
output_format: "combined" # "combined" (h5ad) or "individual" (npy)- Run inference:
python process.py# Basic run with config file
python process.py
# Specify parameters via CLI
python process.py --output_dir ./results --batch_size 256 --gpu 0
# Custom config and input files
python process.py --config experiment_config.yaml --path_list experiment_data.csv -o ./results
# Embeddings only (faster, no classification)
python process.py -o ./results --embeddings_only
# Get help
python process.py --helpRecommended Format:
r_image,y_image,b_image,g_image,output_prefix
path/to/image1_mt.png,,path/to/image1_nuc.png,path/to/image1_prot.png,sample_1
path/to/image2_mt.png,,path/to/image2_nuc.png,path/to/image2_prot.png,batch_A/sample_2
- Skip rows by prefixing with
# - Create subfolders in the output folder by them to output_prefix like: /subfolder_1/sample_1
Legacy Format (deprecated but supported):
r_image,y_image,b_image,g_image,output_folder,output_prefix
...
| Parameter | Description | Default | Example |
|---|---|---|---|
--config |
Path to configuration YAML file | config.yaml |
experiment.yaml |
--path_list |
Path to input CSV file | path_list.csv |
data.csv |
--output_dir -o |
Output directory for all results | - | ./results |
--model_channels -c |
Channel configuration | rybg |
rbg, ybg, bg |
--model_type -m |
Model architecture | mae_contrast_supcon_model |
vit_supcon_model |
--output_format |
Output format | combined |
individual |
--num_workers -w |
Data loading workers | 4 |
8 |
--gpu -g |
GPU device ID (-1 = CPU) | -1 |
0 |
--batch_size -b |
Batch size | 128 |
256 |
--embeddings_only |
Skip classification | False |
- |
| Parameter | Description | Default |
|---|---|---|
--update_model -u |
Download/update models | False |
--prefetch_factor -p |
Prefetch batches | 2 |
--create_csv |
Generate combined CSV | False |
--save_attention_maps |
Save attention visualizations | False |
--async_saving |
Async file saving (individual only) | False |
--quiet -q |
Suppress verbose logging | False |
File: embeddings.h5ad (AnnData-compatible)
import anndata as ad
# Load results
adata = ad.read_h5ad("results/embeddings.h5ad")
# Access data
embeddings = adata.X # (n_samples, 1536)
probabilities = adata.obsm['probabilities'] # (n_samples, 31)
sample_ids = adata.obs_names # Image identifiersCompatible with: scanpy and other single-cell tools
Files per image:
{output_prefix}_embedding.npy- 1536D embedding vector{output_prefix}_probabilities.npy- 31-class probability distribution{output_prefix}_attention_map.png- Attention visualization (optional)
import numpy as np
embedding = np.load("results/cell1_embedding.npy") # Shape: (1536,)
probs = np.load("results/cell1_probabilities.npy") # Shape: (31,)File: result.csv
| Column | Description |
|---|---|
id |
Sample identifier |
top_class_name |
Top predicted location |
top_class |
Top class index |
top_3_classes_names |
Top 3 predictions (comma-separated) |
top_3_classes |
Top 3 indices |
prob00 - prob30 |
Full probability distribution |
feat0000 - feat1535 |
Full embedding vector |
The model predicts 31 subcellular locations:
View all 31 classes
- Actin filaments
- Aggresome
- Cell Junctions
- Centriolar satellite
- Centrosome
- Cytokinetic bridge
- Cytoplasmic bodies
- Cytosol
- Endoplasmic reticulum
- Endosomes
- Focal adhesion sites
- Golgi apparatus
- Intermediate filaments
- Lipid droplets
- Lysosomes
- Microtubules
- Midbody
- Mitochondria
- Mitotic chromosome
- Mitotic spindle
- Nuclear bodies
- Nuclear membrane
- Nuclear speckles
- Nucleoli
- Nucleoli fibrillar center
- Nucleoli rim
- Nucleoplasm
- Peroxisomes
- Plasma membrane
- Vesicles
- Negative
Class names and visualization colors available in inference.py (CLASS2NAME, CLASS2COLOR dictionaries).
Models are automatically downloaded on first run with -u/--update_model:
python process.py -u --output_dir ./resultsEdit models_urls.yaml to specify custom model URLs:
rybg: # 4-channel configuration
mae_contrast_supcon_model:
encoder: "s3://bucket/path/to/encoder.pth"
classifier_s0: "https://url/to/classifier.pth"If you use SubCellPortable in your research, please cite:
@article{gupta2024subcell,
title={SubCell: Subcellular protein localization foundation model},
author={Gupta, Ankit and others},
journal={bioRxiv},
year={2024},
doi={10.1101/2024.12.06.627299}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
SubCellPortable wrapper maintained with β€οΈ by the Lundberg Lab for the computational biology community.