Skip to content

Commit 7cd03b6

Browse files
committed
Merge branch 'feature-cell-ranger'
2 parents 4722e55 + 592ac4d commit 7cd03b6

File tree

136 files changed

+491094
-608
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

136 files changed

+491094
-608
lines changed

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/tests/data/* linguist-generated=true

.github/workflows/build.yml

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,16 @@ jobs:
77
runs-on: ubuntu-latest
88
strategy:
99
max-parallel: 5
10+
defaults:
11+
run:
12+
shell: bash -el {0}
1013

1114
steps:
1215
- uses: actions/checkout@v4
13-
- name: Set up Python 3.10
14-
uses: actions/setup-python@v5
16+
- uses: conda-incubator/setup-miniconda@v3
1517
with:
16-
python-version: '3.9'
17-
- name: Add conda to system path
18-
run: |
19-
# $CONDA is an environment variable pointing to the root of the miniconda directory
20-
echo $CONDA/bin >> $GITHUB_PATH
21-
- name: Setup Conda environment
22-
run: |
23-
conda env update --file environment.yml --name base
18+
activate-environment: rnaseq-pipeline
19+
environment-file: environment.yml
2420
- name: Install package
2521
run: |
2622
pip install .[gsheet,webviewer]
@@ -29,7 +25,7 @@ jobs:
2925
make -C scripts
3026
- name: Configure the pipeline
3127
run: |
32-
cp example.luigi.cfg luigi.cfg
28+
cp tests/luigi.cfg ./
3329
- name: Test with pytest
3430
run: |
3531
conda install pytest

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ install: install-python install-systemd-units install-RSEM install-scripts insta
1616

1717
install-fish-completion:
1818
mkdir -p "${DESTDIR}/etc/fish/completions"
19-
install data/luigi.fish "${DESTDIR}/etc/fish/completions/"
19+
install -m644 data/luigi.fish "${DESTDIR}/etc/fish/completions/"
2020

2121
install-scripts:
2222
$(MAKE) -C scripts install
@@ -26,7 +26,7 @@ install-python:
2626

2727
install-systemd-units:
2828
mkdir -p "${DESTDIR}/etc/systemd/system/"
29-
install data/systemd/*.{service,timer,target} "${DESTDIR}/etc/systemd/system/"
29+
install -m644 data/systemd/*.{service,timer} "${DESTDIR}/etc/systemd/system/"
3030
@echo "Remember to run 'systemctl override rnaseq-pipeline-viewer' and 'systemctl override rnaseq-pipeline-worker@' and set CONDA_BIN, CONDA_ENV, GEMMA_USERNAME and GEMMA_PASSWORD environment variables."
3131

3232
install-RSEM:

README.md

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,7 @@ your tasks at http://localhost:8082/.
7171
luigid
7272
```
7373

74-
For convenience, we provide a `luigi-wrapper` script that sets the `--module`
75-
flag to `rnaseq_pipeline.tasks` for you.
74+
For convenience, we provide a `rnaseq-pipeline-cli` tool to run high-level tasks:
7675

7776
```bash
7877
luigi-wrapper <task> <task_args>
@@ -102,13 +101,16 @@ The output is organized as follow:
102101

103102
```
104103
pipeline-output/
105-
genomes/<reference_id>/ # Genomic references
106-
references/<reference_id>/ # RSEM/STAR indexes
107-
data/<source> # FASTQs (note that GEO source uses SRA)
108-
data-qc/<experiment_id>/<sample_id>/ # FastQC reports
109-
aligned/<reference_id>/<experiment_id>/ # alignments and quantification results
110-
quantified/<reference_id> # quantification matrices for isoforms and genes
111-
report/<reference_id>/<experiment_id>/ # MultiQC reports for reads and alignments
104+
genomes/<reference_id>/ # Genomic references
105+
references/<reference_id>/ # RSEM/STAR indexes
106+
references-single-cell/<reference_id>/ # Cell Ranger references
107+
data/<source>/ # FASTQs (organization is source-specific; note that GEO source uses SRA)
108+
data-qc/<experiment_id>/<sample_id>/ # FastQC reports
109+
data-single-cell/<experiment_id>/<sample_id>/ # Single-cell data (hard links to files from data/)
110+
aligned/<reference_id>/<experiment_id>/ # alignments and quantification results
111+
quantified/<reference_id> # quantification matrices for isoforms and genes
112+
quantified-single-cell/<reference_id> # quantified single-cell data (Cell Ranger outputs)
113+
report/<reference_id>/<experiment_id>/ # MultiQC reports for reads and alignments
112114
```
113115

114116
You can adjust the pipeline output directory by setting `OUTPUT_DIR` under
@@ -146,6 +148,26 @@ pip install .[webviewer]
146148
gunicorn rnaseq_pipeline.viewer:app
147149
```
148150

151+
## Tools Wrappers
152+
153+
A few wrappers are provided to make some tools run more efficiently. For this to work, you have to configure Bioluigi to
154+
use the wrappers instead of the actual tools.
155+
156+
Examples of behaviors:
157+
158+
- copy the reference directory to a local scratch directory (Cell Ranger & RSEM)
159+
- preload genome reference in shared memory and release unused ones (RSEM only)
160+
161+
```ini
162+
[bioluigi]
163+
cellranger_bin=rnaseq-pipeline-cellranger
164+
rsem_calculate_expression_bin=rnaseq-pipeline-rsem-calculate-expression
165+
166+
[rnaseq_pipeline.wrapped_tools]
167+
cellranger_bin=/absolute/path/to/cell/ranger/bin
168+
rsem_calculate_expression_bin=/absolute/path/to/rsem
169+
```
170+
149171
## Gemma integration
150172

151173
The RNA-Seq pipeline is capable of communicating with Gemma using its [RESTful API](https://gemma.msl.ubc.ca/resources/restapidocs/).

environment.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,15 @@ channels:
44
- bioconda
55
- nodefaults
66
dependencies:
7-
- python=3.10
7+
- python=3.12
88
- pip
9-
- cutadapt==4.8
10-
- multiqc==1.29
9+
- cutadapt==4.9
10+
- multiqc==1.32
1111
- polars-lts-cpu # for our older servers that lack support for AVX2
1212
- sra-tools
1313
- fastqc==0.12.1
1414
- star==2.7.3a
1515
- entrez-direct
1616
- perl # rsem expects this
17+
- samtools
18+
- curl

example.luigi.cfg

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,12 @@ submit_data_jobs=1
2727
submit_batch_info_jobs=2
2828

2929
[bioluigi]
30-
scheduler=slurm
30+
scheduler=local
3131
scheduler_partition=
3232
scheduler_extra_args=[]
33+
# Default tools, override as needed
34+
#cutadapt_bin=cutadapt
35+
#cell_ranger_bin=cellranger
3336

3437
#
3538
# This section contains the necessary variables for the pipeline execution
@@ -40,19 +43,33 @@ scheduler_extra_args=[]
4043
OUTPUT_DIR=pipeline-output
4144
GENOMES=genomes
4245
REFERENCES=references
46+
SINGLE_CELL_REFERENCES=references-single-cell
4347
METADATA=metadata
4448
DATA=data
4549
DATAQCDIR=data-qc
4650
ALIGNDIR=aligned
47-
ALIGNQCDIR=aligned-qc
4851
QUANTDIR=quantified
4952
BATCHINFODIR=batch-info
5053

5154
# RSEM
5255
RSEM_DIR=contrib/RSEM
56+
rsem_calculate_expression_bin=contrib/RSEM/rsem-calculate-expression
5357

5458
SLACK_WEBHOOK_URL=
5559

60+
[rnaseq_pipeline.wrapped_tools]
61+
rsem_calculate_expression_bin=rsem-calculate-expression
62+
cellranger_bin=cellranger
63+
64+
[rnaseq_pipeline.sources.sra]
65+
# location where tools like prefetch and fastq-dump will store downloaded SRA files
66+
# you can get this value with vdb-config -p
67+
ncbi_public_dir=/cosmos/scratch/ncbi/public
68+
samtools_bin=samtools
69+
bamtofastq_bin=bamtofastq
70+
# location where BAM headers downloaded from SRA will be cached
71+
bam_headers_cache_dir=bam_headers
72+
5673
[rnaseq_pipeline.gemma]
5774
cli_bin=gemma-cli
5875
# values for $JAVA_HOME and $JAVA_OPTS environment variables
@@ -63,3 +80,6 @@ appdata_dir=/space/gemmaData
6380
human_reference_id=hg38_ncbi
6481
mouse_reference_id=mm10_ncbi
6582
rat_reference_id=rn7_ncbi
83+
human_single_cell_reference_id=refdata-gex-GRCh38-2024-A
84+
mouse_single_cell_reference_id=refdata-gex-GRCm39-2024-A
85+
rat_single_cell_reference_id=refdata-gex-mRatBN7-2-2024-A

pyproject.toml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,36 @@
11
[build-system]
22
requires = ["setuptools"]
33
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "rnaseq-pipeline"
7+
version = "2.1.12"
8+
description = "RNA-Seq pipeline for the Pavlidis Lab"
9+
authors = [
10+
{name = "Guillaume Poirier-Morency", email = "poirigui@msl.ubc.ca"}
11+
]
12+
readme = "README.md"
13+
license = "Unlicense"
14+
license-files = ["LICENSE"]
15+
requires-python = "==3.12.*"
16+
dependencies = ['luigi', 'python-daemon<3.0.0',
17+
'bioluigi@git+https://github.com/PavlidisLab/bioluigi@master',
18+
'requests', 'pandas']
19+
20+
[project.optional-dependencies]
21+
gsheet = ['google-api-python-client', 'google-auth-httplib2', 'google-auth-oauthlib', 'pyxdg']
22+
webviewer = ['Flask', 'gunicorn']
23+
24+
[dependency-groups]
25+
dev = ["pytest", "mypy"]
26+
27+
[project.scripts]
28+
rnaseq-pipeline-cli = "rnaseq_pipeline.cli:main"
29+
rnaseq-pipeline-cellranger = "rnaseq_pipeline.wrapped_tools:cellranger_wrapper"
30+
rnaseq-pipeline-rsem-calculate-expression = "rnaseq_pipeline.wrapped_tools:rsem_calculate_expression_wrapper"
31+
32+
[tool.setuptools]
33+
packages = ["rnaseq_pipeline", "rnaseq_pipeline.sources", "rnaseq_pipeline.webviewer"]
34+
35+
[tool.mypy]
36+
plugins = ["luigi.mypy"]

pytest.ini

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[pytest]
2+
testpaths=tests
3+
log_cli=1
4+
log_cli_level=info

rnaseq_pipeline/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
11
import luigi
22

33
luigi.auto_namespace(scope=__name__)
4+
5+
from rnaseq_pipeline.tasks import *
6+
from rnaseq_pipeline.sources.sra import *
7+

rnaseq_pipeline/cli.py

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
import argparse
2+
import sys
3+
import os
4+
from contextlib import contextmanager
5+
6+
import luigi
7+
import luigi.cmdline
8+
9+
from rnaseq_pipeline.tasks import SubmitExperimentToGemma, SubmitExperimentsFromGoogleSpreadsheetToGemma, \
10+
SubmitExperimentBatchInfoToGemma
11+
12+
@contextmanager
13+
def umask(umask):
14+
print(f'Setting umask to 0x{umask:03o}')
15+
prev_umask = os.umask(umask)
16+
try:
17+
yield None
18+
finally:
19+
print(f'Restoring umask to 0x{prev_umask:03o}')
20+
os.umask(prev_umask)
21+
22+
def parse_octal(s):
23+
return int(s, 8)
24+
25+
def run_luigi_task(task, args):
26+
with umask(args.umask):
27+
luigi.build([task], workers=args.workers, detailed_summary=True, local_scheduler=args.local_scheduler)
28+
29+
def run(args):
30+
with umask(0o002):
31+
luigi.run(args)
32+
33+
def submit_experiment(argv):
34+
parser = argparse.ArgumentParser()
35+
parser.add_argument('--experiment-id', required=True, help='Experiment ID to submit to Gemma')
36+
parser.add_argument('--rerun', action='store_true', default=False, help='Rerun the experiment')
37+
parser.add_argument('--priority', type=int, default=100)
38+
parser.add_argument('--umask', type=parse_octal, default='002',
39+
help='Set a umask (defaults to 002 to make created files group-writable)')
40+
parser.add_argument('--workers', type=int, default=30, help='Number of workers to use (defaults to 30)')
41+
parser.add_argument('--local-scheduler', action='store_true', default=False)
42+
args = parser.parse_args(argv)
43+
run_luigi_task(SubmitExperimentToGemma(experiment_id=args.experiment_id, rerun=args.rerun, priority=args.priority),
44+
args)
45+
46+
def submit_experiment_batch_info(argv):
47+
parser = argparse.ArgumentParser()
48+
parser.add_argument('--experiment-id', required=True, help='Experiment ID to submit to Gemma')
49+
parser.add_argument('--ignored-samples', nargs='+', default=[])
50+
parser.add_argument('--rerun', action='store_true', default=False, help='Rerun the experiment')
51+
parser.add_argument('--umask', type=parse_octal, default='002',
52+
help='Set a umask (defaults to 002 to make created files group-writable)')
53+
parser.add_argument('--workers', type=int, default=30, help='Number of workers to use (defaults to 30)')
54+
parser.add_argument('--local-scheduler', action='store_true', default=False)
55+
args = parser.parse_args(argv)
56+
print(args.ignored_samples)
57+
run_luigi_task(
58+
SubmitExperimentBatchInfoToGemma(experiment_id=args.experiment_id, ignored_samples=args.ignored_samples,
59+
rerun=args.rerun), args)
60+
61+
def submit_experiments_from_gsheet(argv):
62+
parser = argparse.ArgumentParser()
63+
parser.add_argument('--spreadsheet-id', required=True, help='Spreadsheet ID')
64+
parser.add_argument('--sheet-name', required=True, help='Sheet name')
65+
parser.add_argument('--umask', type=parse_octal, default='002',
66+
help='Set a umask (defaults to 002 to make created files group-writable)')
67+
parser.add_argument('--workers', type=int, default=200, help='Number of workers to use (defaults to 200)')
68+
parser.add_argument('--ignore-priority', action='store_true', help='Ignore the priority column in the spreadsheet')
69+
parser.add_argument('--local-scheduler', action='store_true', default=False)
70+
args = parser.parse_args(argv)
71+
run_luigi_task(SubmitExperimentsFromGoogleSpreadsheetToGemma(args.spreadsheet_id, args.sheet_name,
72+
ignore_priority=args.ignore_priority), args)
73+
74+
def main():
75+
if len(sys.argv) < 2:
76+
print('Usage: rnaseq-pipeline-cli <command>')
77+
return 1
78+
command = sys.argv[1]
79+
if command == 'run':
80+
return run(sys.argv[2:])
81+
elif command == 'submit-experiment':
82+
return submit_experiment(sys.argv[2:])
83+
elif command == 'submit-experiment-batch-info':
84+
return submit_experiment_batch_info(sys.argv[2:])
85+
elif command == 'submit-experiments-from-gsheet':
86+
return submit_experiments_from_gsheet(sys.argv[2:])
87+
else:
88+
print(
89+
f'Unknown command {command}. Possible values are: submit-experiment, submit-experiment-batch-info, submit-experiments-from-gsheet.')
90+
return 1

0 commit comments

Comments
 (0)