This repository includes the R package CellREST
(Cellular
Relationships in Expression-state- based Single-cell
Trees) as introduced in
Julia Naas, Arndt von Haeseler, Christiane Elgert. "Rethinking scRNA-seq Trajectories in Phylogenetic Paradigms: Overcoming Challenges of Missing Ancestral Information“ bioRxiv 2025.07.22.664676; https://doi.org/10.1101/2025.07.22.664676.
It is compatible with a typical Seurat
(Hao et al.
2024) workflow to analyse single-cell RNA-sequencing
(scRNA-seq) data.
You can install the development version of CellREST
R Package from
GitHub via R package
remotes
# if (!require("remotes", quietly = TRUE))
# install.packages("remotes")
remotes::install_github("jn-goe/CellREST", dependencies = T)
CellREST depends on R Packages ggplot2, ggtree, igraph, Seurat
and
imports
pheatmap, ape, castor, dplyr, ggimage, magrittr, Rogue, scales, seqinr, treespace, uwot, viridis,pals
,
which are installed together with CellREST if dependencies = T
.
We recommend to install Rogue
and treespace
prior to installing
CellREST. Dependency R Package
ggtree
has to be installed via
BiocManager
.
For the below described CellREST workflow, we apply Adaptively
Thresholded Low-Rank Approximation (ALRA) (Linderman et al.
2022) imputation to the scRNA-seq data. ALRA is
implemented in the R Package SeuratWrappers
and can be installed
following installation guidelines in the respective SeuratWrappers GitHub
Repository.
For Maximum Likelihood tree reconstruction, IQ-TREE (Minh et al.
2020) can be installed
locally or called via web servers following instructions on the IQ-TREE
website. We provide the bash script
run_iqtree.sh
that wraps the IQ-TREE call in a function run_iqtree
and enables parallel tree reconstructions using multiple
threads e.g. via xargs
(see example of use below). In all our analyses, we used IQ-TREE version
2 and hence called IQ-TREE via the iqtree2
command. If you use another
version or command, please adapt it in run_iqtree.sh
, accordingly. For
the below outlined exemplary workflow, IQ-TREE output files are readily
provided in the directory toy_data
.
We recommend to use R
versions >= 4.3.0. The CellREST package has
been tested for R (v4.4.0)
and package versions
ggplot2 (v3.5.1), ggtree (v3.12.0), igraph (v2.0.3), Seurat (v5.1.0), pheatmap (v1.0.12), ape (v5.8), castor (v1.8.2), dplyr (v1.1.4), ggimage (v0.3.3), magrittr (v2.0.3), Rogue (v2.1.6), scales (v1.3.0), seqinr (v4.2-36), treespace (v1.1.4.3), uwot (v0.2.3), viridis (v0.6.5), pals (v1.9), SeuratWrappers (v0.3.5)
on macOS 15.5.
The runtime for the CellREST installation on a standard computer (16 GB RAM) is around 25min and the workflow starting from IQ-TREE output files as demonstrated below (314 cell scRNA-seq toy data) around 4min.
If you are have any issue during the installation or application of CellREST please raise a repository issue.
Here we show the workflow of CellREST on a simulated scRNA-seq dataset,
in which cells are simulated to follow a simply cycle topology (via
dyngen
(Cannoodt et al. 2021)). To simulate missing transient states, we
deleted cells in two disjunct simulation time intervals. First we
perform a typical downstream analysis, based on the R package Seurat
:
library(CellREST)
data_dir <- "toy_data/cycle_simple_sub_ALRA_alra_300hvg_ACGTbinning_rateG4_pers01_fast"
obj <- readRDS("toy_data/cycle_simple_seurat_1SIM_subset_obj.RDS")
# cell-wise log-normalization
obj <- NormalizeData(obj)
# select highly variable genes (hvgs)
obj <- FindVariableFeatures(obj, nfeatures = 300)
# gene-wise scaling (hvgs)
obj <- ScaleData(obj)
# Principal Component Analysis (on hvgs)
obj <- RunPCA(obj)
# UMAP embedding for visualization
obj <- RunUMAP(obj, reduction = "pca", dims = 1:50)
# plot UMAP, cells color-coded by scaled simulation time
FeaturePlot(obj, feature = "sim_time_scaled") +
NoAxes() + scale_color_viridis_c() + ggtitle("")
Subsequently, we use ALRA provided in R Package SeuratWrappers
to
impute technical zeros:
library(SeuratWrappers)
# ALRA imputation of technical zeros
set.seed(123)
obj <- RunALRA(obj)
#> Rank k = 3
#> Identifying non-zero values
#> Computing Randomized SVD
#> Find the 0.001000 quantile of each gene
#> Thresholding by the most negative value of each gene
#> Scaling all except for 92 columns
#> 0.00% of the values became negative in the scaling process and were set to zero
#> The matrix went from 12.10% nonzero to 78.85% nonzero
#> Setting default assay as alra
As a first CellREST workflow step, we create the pseudo-alignment from log-normalized scRNA-seq expression profiles:
# create and write pseudo-alignment
seurat_to_fasta(obj,
varf = VariableFeatures(obj, assay = "RNA"),
pre = "cycle_simple_sub_RNA",
dir = data_dir)
The FASTA file can now be input to IQ-TREE to reconstruct multiple single-cell-labeled
Maximum Likelihood trees (scML-trees). For consistency, we suggest to
fix the model via -m GTR+F+G4
. For a parallized IQ-TREE call across
multiple seeds, the following outlines an examplary use of the provided
run_iqtree.sh
bash script that wraps the IQ-TREE call in a function run_iqtree
and enables parallel tree reconstructions using multiple
threads via xargs
. Please adjust directories OUTPUT_DIR
and FASTA_FILE
as well as variables MAX_PARALLEL_JOBS
and N_CORES
according to your available computing capacities.
source run_iqtree.sh
export -f run_iqtree
OUTPUT_DIR="output_directory"
FASTA_FILE="directory_of_FASTA_file"
N_REPLICATES=100 # number of tree inferences
SEEDS=$(seq 1 $N_REPLICATES) # seeds for tree inferences
MAX_PARALLEL_JOBS=$(($(nproc) - 10)) # maximal number of parallel jobs
N_CORES=1 # number of cores per job
MODEL="GTR+F+G4" # phylogenetic model
FAST_STR="true" # fast tree inference
PERS_VAL=0.1 # pertubation strength
LOG_FILE="$OUTPUT_DIR/iqtree_parallel.log" # combined LOG file directory
printf "%s\n" "${SEEDS[@]}" | xargs -n 1 -P "$MAX_PARALLEL_JOBS" bash -c 'run_iqtree "$@"'
_ "$OUTPUT_DIR" "$FASTA_FILE" "$MODEL" "$N_CORES" "$FAST_STR" "$PERS_VAL" "$LOG_FILE"
Another, slower option is to specify the number of successive tree reconstructions in the IQ-TREE
command via --runs 100
. In this case, only one seed can be chosen and
reproducibility is only ensured if a number of cores N_CORES
is
fixed:
iqtree2 -s FASTA_FILE -m GTR+F+G4 -nt N_CORES -seed SEED -fast --runs 100
After all IQ-TREE runs are completed, trees and likelihoods can be imported in R from the IQ-TREE output folder, which contains either 100 TREEFILES or one RUNTREES file:
res <- import_trees_lh(data_dir)
tree_list <- res$tree_list
R Package ggtree
can help you to get a first brief visualization of an
exemplary scML-tree:
library(ggtree)
# choose one tree in tree list as example
tree <- tree_list[[100]]
# tree plot with simulation time based branch coloring
d <- data.frame(node=1:(tree$Nnode+length(tree$tip.label)),
color = c(obj$sim_time_scaled[tree$tip.label],
rep(NA, tree$Nnode)))
(ggtree(tree, layout="equal_angle") %<+% d + aes(color=color)) +
geom_tippoint() +
scale_color_viridis_c(name = "simulated time - scaled")
For further CellREST analyses, we however want to visualize all inferred
scML-trees and store them as PNGs, which will be the basis of
tree-embedding visualizations. The function
tree_list_pngs(..., check_for_existing_PNGs = T)
checks, whether tree
PNGs are available in directory dir_plots
. In this example, we provide
corresponding PNGs. However, new PNGs are created for
check_for_existing_PNGs = F
or in case no PNGs are found in directory
dir_plots
.
# make PNGS for all trees in tree list
png_list <- tree_list_pngs(obj = obj,
tree_list = tree_list,
col_by = "sim_time_scaled",
dir_plots = data_dir,
check_for_existing_PNGs = T,
im_res = 50)
Subsequently, tree-PCA or tree-UMAP plots can be created to visualize
variability between trees with respect to distance measures indicated
via distance_measures
, here the patristic-correlation. If this
function exceeds available R cache resources, try to create new tree
PNGs with lower resolution (see above
tree_list_pngs(..., im_res = 50)
).
tree_plots <- plot_tree_UMAP(obj = obj,
tree_list = tree_list,
png_list = png_list,
distance_measures = "patristic_correlation",
show_plot = TRUE,
title = "patristic correlation tree distances")
#> [1] "patristic_correlation finished."
Finally, trees can be combined into a single-cell network (sc-network) by transforming trees into nearest neighbor tree-graphs (nn-tree-graphs) and computing their union.
obj <- make_knn_networks(obj = obj,
edge_length_representations = "min",
tree_list = tree_list,
k_dist = 4) # max_branch
network_all <- obj@misc$cellREST_igraph_obj
network_all
#> IGRAPH 888fab2 UNW- 314 1404 --
#> + attr: name (v/c), weight (e/n), min_length (e/n), min_branch_count
#> | (e/n)
#> + edges from 888fab2 (vertex names):
#> [1] cell100 --cell1103 cell100 --cell1559 cell100 --cell169 cell100 --cell191
#> [5] cell1000--cell1630 cell1000--cell1765 cell1000--cell1864 cell1000--cell240
#> [9] cell1000--cell308 cell1000--cell486 cell1000--cell505 cell1000--cell515
#> [13] cell1000--cell622 cell1000--cell667 cell1011--cell1219 cell1011--cell1336
#> [17] cell1011--cell1460 cell1011--cell1643 cell1011--cell1969 cell1011--cell217
#> [21] cell1011--cell398 cell1011--cell544 cell1012--cell1018 cell1012--cell1854
#> [25] cell1012--cell260 cell1012--cell494 cell1012--cell571 cell1012--cell636
#> + ... omitted several edges
For visualization of the network, we can either use already present
dimensionality reductions (see parameter umap_key
). Here, however, we use
CellREST distances to compute a sc-network informed UMAP:
obj <- embed_network(obj = obj,
network = network_all,
umap_key = "network_umap",
edge_length_attr = "min_length") # CellREST distance
distmat_graph <- obj@misc$cellREST_distmat
plot_network(obj = obj,
network = network_all,
umap_key = "network_umap",
edge_width_factor = 3,
col_by = "sim_time_scaled")
In this visualization we can observe network edges with longer lengths. For a more detailed analysis, we visualize the distribution of logarithmized edge lengths and edge weights. For the former, we can fit a normal distribution to determine a threshold to detect potential outliers:
network <- prune_network(network = network_all,
quantile = 0.9999,
pt.size = 2,
pt.alpha = .5)
Four edge lengths are larger than the theoretical 99.99% quantile of the fitted normal distribution. Let’s visualize, where those outlier edges are located in the UMAP:
edge.col_red <- (paste0(as_edgelist(network_all)[,1], as_edgelist(network_all)[,2]) %in%
paste0(as_edgelist(network)[,1], as_edgelist(network)[,2]))
edge.col_red <- c("darkred", "grey")[as.numeric(edge.col_red)+1]
obj <- embed_network(obj = obj,
network = network_all,
edge_length_attr = "min_length",
umap_key = "network_umap_min_length")
plot_network(obj = obj,
network = network_all,
umap_key = "network_umap_min_length",
col_by = "sim_time_scaled",
edge_width_factor = NULL,
edge_width_attr = 3,
edge.color = edge.col_red)
Now, the pruned network after deleting the four outlier edges can be plotted:
obj <- embed_network(obj = obj,
network = network,
edge_length_attr = "min_length",
umap_key = "network_umap_min_length")
plot_network(obj = obj,
network = network,
umap_key = "network_umap_min_length",
col_by = "sim_time_scaled",
edge_width_attr = "weight",
edge_width_factor = 3)
The CellREST distance informed UMAP embedding does not change, since removing the outlier edges had no effect on the CellREST distances.
We can also use other edge attributes to compute a shortest path distance, e.g., the minimal number of branches connecting two cells. This distance can be used for a UMAP embedding as following:
obj <- embed_network(obj = obj,
network = network,
umap_key = "network_umap_branch_count",
edge_length_attr = "min_branch_count")
plot_network(obj = obj,
network = network,
umap_key = "network_umap_branch_count",
col_by = "sim_time_scaled",
edge_width_attr = "weight",
edge_width_factor = 3)
Cannoodt, Robrecht, Wouter Saelens, Louise Deconinck, and Yvan Saeys. 2021. “Spearheading Future Omics Analyses Using Dyngen, a Multi-Modal Simulator of Single Cells.” Nature Communications 12 (1): 3942. https://doi.org/10.1038/s41467-021-24152-2.
Hao, Yuhan, Tim Stuart, Madeline H. Kowalski, Saket Choudhary, Paul Hoffman, Austin Hartman, Avi Srivastava, et al. 2024. “Dictionary Learning for Integrative, Multimodal and Scalable Single-Cell Analysis.” Nature Biotechnology 42 (2): 293–304. https://doi.org/10.1038/s41587-023-01767-y.
Linderman, George C., Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A. Flavell, Boaz Nadler, and Yuval Kluger. 2022. “Zero-Preserving Imputation of Single-Cell RNA-Seq Data.” Nature Communications 13 (1): 192. https://doi.org/10.1038/s41467-021-27729-z.
Minh, Bui Quang, Heiko A Schmidt, Olga Chernomor, Dominik Schrempf, Michael D Woodhams, Arndt von Haeseler, and Robert Lanfear. 2020. “IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era.” Molecular Biology and Evolution 37 (5): 1530–34. https://doi.org/10.1093/molbev/msaa015.