A Compi RNA-Seq pipeline to perform differential expression using DElite and enrichment analysis using RCPA and pathfindR.
A Docker image is available for this pipeline in this Docker Hub repository. To run the pipeline locally, see the required dependencies.
- Using the Compi RNA-Seq pipeline image in Linux
 - Running the pipeline with sample data
 - Pipeline configuration
 - Team
 - Publications
 
To perform an analysis, users must first:
- Initialize a working directory with the files required by the pipeline.
 - Add the input data to be analyzed (FASTQ reads, genomes, configuration files, etc.).
 - Configure the pipeline parameters.
 
This section provides a comprehensive guide on how to perform these steps and describes the tools and scripts included in the pipeline image to do so easily.
To start a new analysis, the first step is to build the directory tree in your local file system. This directory tree will be referred to as the working directory, and its structure is recognized and used by the pipeline during the analysis.
To build the working directory, adapt the first line of the following code and run it:
WORKING_DIRECTORY=/path/to/the/working-directory
mkdir -p ${WORKING_DIRECTORY}
docker run --rm \
    -v ${WORKING_DIRECTORY}:${WORKING_DIRECTORY} \
    -u "$(id -u)":"$(id -g)" \
    --entrypoint=/bin/bash \
        singgroup/compi-rnaseq \
            init_working_dir.sh ${WORKING_DIRECTORY}After running this command, the selected working directory should have the following structure:
├── compi.parameters
├── config
│   ├── contrasts.tsv
│   ├── pathfindR.csv
│   └── rcpa.txt
├── genes
├── genome
├── pipeline.png
├── README.txt
├── run.sh
├── run-trimmomatic-1.sh
├── run-trimmomatic-2.sh
└── samples
    └── metadata.tsv
Where:
README.txtcontains the next steps you need to follow to run the analysis.pipeline.pngcontains the pipeline graph.compi.parameterscontains the paths and parameters needed for the analysis.run.shis the script to run the analysis.samplesis the folder where the input FASTQ files must be placed.- It must also contain a 
metadata.tsvfile with the sample metadata (names and groups). 
- It must also contain a 
 genomeis the folder where the input genome must be placed.genesis the folder where the input GTF annotation file must be placed.configis the folder where the input configuration files must be placed. It may contain:- An optional file called 
contrasts.tsvwith the DEA contrasts to be performed (if not provided, the pipeline generates all combinations based on the information in themetadata.tsvfile). - A file called 
pathfindR.tsvindicating the gene sets for enrichment (KEGG, Reactome, BioCarta, GO-All, GO-BP, GO-CC, or GO-MF; all for Homo sapiens) and the protein-protein interaction network (Biogrid, STRING, GeneMania, IntAct, KEGG, or mmu_STRING) for the pathfindR analysis. It is a two-column CSV file where the first column is the gene set and the second is the protein-protein interaction network. Lines starting with#are skipped, and one pathfindR analysis for each line will be executed. - An optional file called 
rcpa.txtwith additional parameters for RCPA. 
- An optional file called 
 
It is possible to test the pipeline using our sample data available here or here.
Download any of the ZIP files and decompress them in your local file system. Edit the compi.parameters file to update the working_dir parameter so that it points to to the path where you have the decompressed data.
Then, to execute the pipeline using Docker, run the following command changing the /path/to/rna-seq-docker/data/ to the path where you have the decompressed data.
./run.sh /path/to/rna-seq-docker/data/compi.parametersPipeline results will be created in a directory called compi inside the main data directory.
The pipeline execution can be customized (e.g. setting the maximum number of parallel tasks, partial executions, and so on) by providing an additional parameter to the run.sh script. Below are some examples:
./run.sh /path/to/rna-seq-docker/data/compi.parameters "--single-task samtools --num-tasks 2"./run.sh /path/to/rna-seq-docker/data/compi.parameters "--from prepare-deas --until add-mappings"Analyses can be customized by changing the pipeline configuration in the Compi parameters file (i.e. compi.parameters).
Some tasks (DElite, pathfindR, and RCPA) require or accept additional configuration files placed at the config directory of the working directory.
This subsection explains the main configuration parameters to customize the pipeline execution and include optional tasks.
By default, the software used for counting reads is featureCounts. This is because featurecounts is the default value of the counts_method parameter. To change it, include counts_method=htseq in your Compi parameters file.
This tool may require more RAM memory than the default setting. To increase it, include the qualimap_additional_args in the Compi parameters file with the value --java-mem-size=2G.
By default, the delite task runs DElite on each possible group/condition combination based on the information in the metadata.tsv file.
Optionally, users may provide a file called config/contrasts.tsv with the DEA contrasts that must be performed. Example:
reference	comparison
A	B
Additional DElite parameters may be specified in the delite_additional_args Compi parameter. Users are encouraged to have a look at the DElite official documentation for further information about its parameters and defautl values. To do so, include it in the Compi parameters file as follows:
delite_additional_args=lowcounts=var var=0.3 combine=bonferroni
Both pathfindr and rcpa tasks use the DEA results produced by DElite as inputs. Each DElite execution creates a directory at working_dir/dea/A_B/DElite_<timestamp> with a specific timestamp. This guarantees that if DElite is run several times previous results are not overwriten.
By default, the pathfindr and rcpa tasks will take the most recent directory. To force the pipeline to run on a specific DElite results directory, include pathway_delite_folder=<DElite_folder> in the Compi parameters file to set the specific directory that must be used.
In addition, both tasks require a single DEA results file as input, and default files are different in each case:
pathfindrusesDEGs_filtered_DElite_lancaster. This means that pathfindR analyzes the filtered file resulting from integrating DEA results with the Lancaster method.rcpausesDEGs_unfiltered_DElite_lancaster. This means that RCPA analyzes the unfiltered file (i.e. it contains all genes) resulting from integrating DEA results with the Lancaster method.
To change this default behaviour, include pathfindr_delite_file_prefix=<prefix> or rcpa_delite_file_prefix=<prefix> in the Compi parameters file. Example:
pathfindr_delite_file_prefix=DEGs_filtered_edgeR
rcpa_delite_file_prefix=DEGs_unfiltered_DESeq2
Note that it is recommended to use unfiltered files in the case of RCPA due to the way in which most method works. Using filtered files with RCPA may cause some methods to not work (e.g. ORA).
The pathfindr task requires that users provide a file called config/pathfindr.csv at the working directory. This file specifies which databases and protein interaction networks must be used by pathfindR. Lines starting with # are ommited. This way, the pathfindR task will run one analysis for each DEA contrast and each line in this file. Example:
KEGG,Biogrid
KEGG,STRING
#Reactome,Biogrid
#BioCarta,Biogrid
#GO-All,Biogrid
#GO-BP,Biogrid
#GO-CC,Biogrid
#GO-MF,Biogrid
The pipeline has three parameters for the rcpa task that can be provided in the Compi parameters file:
rcpa_database(default to KEGG): The database to be used for the pathway analysis. Options: KEGG or GO.rcpa_geneset_analysis_methods(default to fgsea,gsa,ora,ks,wilcox): The methods to be used for the pathway analysis. Several methods can be specified as a comma-separated list. Options: fgsea, gsa, ora, ks, wilcox.rcpa_pathway_analysis_methods(default to spia,cepaORA,cepaGSA): The methods to be used for the pathway analysis. Several methods can be specified as a comma-separated list. Options: spia, cepaORA, cepaGSA.
Additional configuration parameters for each method may be provided by users in a file called config/rcpa.txt at the working directory. This file must include one line for each parameter in the following format: <method_name>.<parameter_name>=<parameter_value>. Example:
ora.pThreshold=0.01
Users are encouraged to have a look at the official RCPA package documentation as well as at this comprehensive paper at Current Protocols about all its functionalities.
By default, Trimmomatic tasks (trimmomatic, trimmomatic-fastqc, and trimmomatic-move) are skipped. To enable them, include enable_trimmomatic in your Compi parameters file.
The Trimmomatic configuration is specified using the trimmomatic_parameters parameter, whose default value is ILLUMINACLIP:/Trimmomatic/adapters/TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:36. This string must contain the parameters to pass to Trimmomatic as a space-separated string. Look at the Trimmomatic manual for more information about this.
Usually, users may run first all tasks until trimmomatic-fastqc (several times) until a suitable trimming configuration is found. Once that happens, they may go ahead with the pipeline execution to run the remaining tasks. To facilitate this, the working directory created with the init_working_dir.sh as well as the sample datasets include two scripts to do this: run-trimmomatic-1.sh and run-trimmomatic-2.sh.
By default, batch correction tasks (batch-correction-all, pca-batch-correction) are skipped. To enable them, include the batch_correction parameter in your Compi parameters file to specify the factor (column in metadata) for batch correction (or interaction to use a combination of all of them). Its default value is none, meaning that no batch correction is applied.
The Compi RNA-Seq pipeline is developed by the SING Research Group (Universidade de Vigo) and Molecular Biology and Transcriptomics Unit (IRCCS Mondino Foundation):
- R. Di Gerlando; S. Gagliardi; H. López-Fernández (2025) A new Compi pipeline for RNA-Seq differential expression analysis. 19th International Conference on Practical Applications of Computational Biology & Bioinformatics: PACBB 2025. Lille, France. 25 - June
 
- H. López-Fernández; A. Blanco-Míguez; F. Fdez-Riverola; B. Sánchez; A. Lourenço (2019) DEWE: a novel tool for executing differential expression RNA-Seq workflows in biomedical research. Computers in Biology and Medicine. Volume 107, pp. 197-205. ISSN: 0010-4825
 
