Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
8a285bc
Use efetch directory with -id instead of esearch
arteymix Apr 28, 2025
4a35d2b
Use conda-incubator/setup-miniconda
arteymix Apr 28, 2025
6b21de9
Fix missing GemmaTaskMixin import
arteymix Sep 16, 2025
8ad0e47
Skip checking GemmaDatasetHasBatch since it requires credentials
arteymix Sep 16, 2025
960c2a3
Add support for single-cell RNA-Seq datasets
arteymix Jun 24, 2025
749eea6
Ignore SRA runs that do not contain transcriptomic RNA-Seq data
arteymix Sep 23, 2025
1d0932d
Parse the --readTypes option
arteymix Sep 23, 2025
bb11982
Improve and fix logging for extracting SRA metadata
arteymix Sep 24, 2025
7861dcc
Validate SRA metadata by reading it prior to writing it to disk
arteymix Sep 24, 2025
0035f85
Do not open the browser in Google OAuth flow
arteymix Sep 25, 2025
3e88020
Add support for 10x BAM submissions to SRA
arteymix Oct 5, 2025
819368a
Update Python to 3.12
arteymix Oct 9, 2025
9310e18
Improvements for local source
arteymix Oct 9, 2025
f4d0588
fixup! Add support for single-cell RNA-Seq datasets
arteymix Oct 9, 2025
46cd60e
Mark test data as generated
arteymix Oct 9, 2025
77f595b
Add missing test data file
arteymix Oct 9, 2025
1396137
Fix Makefile
arteymix Oct 9, 2025
29a3cfb
Replace luigi-wrapper with a simple CLI tool
arteymix Oct 9, 2025
cb32904
Skip fac-sorted dataset test since it's not public
arteymix Oct 9, 2025
f8df3d2
sra: Cache BAM headers
arteymix Oct 9, 2025
0757134
Delete organized single-cell data implement remove() to DownloadRunTa…
arteymix Oct 9, 2025
f4877ef
Fix double-printing of the task summary
arteymix Oct 9, 2025
c31b580
Use the new RNASEQ_PIPELINE_REPORT file type
arteymix Oct 9, 2025
1e25051
Add wrapped tools
arteymix Oct 16, 2025
3428a5d
Add a task to reorganize a split experiment
arteymix Oct 21, 2025
32139a5
Remove unused ALIGNQCDIR
arteymix Oct 21, 2025
5fb0325
Rename output files of bamtofastq not ending in '_001.fastq.gz'
arteymix Oct 21, 2025
5e8006a
Check if read_types is provided when detecting layout
arteymix Oct 21, 2025
8e3fbaa
Rename wrapped tools config section
arteymix Oct 22, 2025
cfd4a30
More work
arteymix Oct 26, 2025
ba16abc
Add missing test data
arteymix Oct 26, 2025
31f87bd
Make it possible to delete an entire run directory instead of individ…
arteymix Oct 27, 2025
bd43c52
sra: Include the SRA run identifier when dumping FASTQ files from a BAM
arteymix Oct 27, 2025
f84c3b1
Reduce the amount of configuration needed for the pipeline
arteymix Oct 27, 2025
04af7e2
gemma: Add targets for specific QTs existing and use those as target …
arteymix Oct 27, 2025
a3a7806
Update cutadapt and MultiQC
arteymix Oct 28, 2025
d8fa72c
Remove redundant task definition and add keyword parameters
arteymix Oct 28, 2025
c9f2945
Migrate to pyproject.toml
arteymix Oct 28, 2025
dd711b6
Move gsheet and webviewer in optional dependencies
arteymix Oct 30, 2025
6ba8539
Remove unused IlluminaFastqHeader and CheckAfterCompleteMixin
arteymix Oct 30, 2025
2e9afee
Add a chemistry option to AlignSingleCellSample
arteymix Oct 30, 2025
916e168
Fix types and imports in tasks.py
arteymix Oct 30, 2025
251b7b6
fixup! Remove unused IlluminaFastqHeader and CheckAfterCompleteMixin
arteymix Oct 30, 2025
d5e46ae
Fix incorrect logger usage in sra.py
arteymix Oct 30, 2025
dcadc15
Downgrade warning for no fastq-load.py options to info
arteymix Oct 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/tests/data/* linguist-generated=true
18 changes: 7 additions & 11 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,16 @@ jobs:
runs-on: ubuntu-latest
strategy:
max-parallel: 5
defaults:
run:
shell: bash -el {0}

steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v5
- uses: conda-incubator/setup-miniconda@v3
with:
python-version: '3.9'
- name: Add conda to system path
run: |
# $CONDA is an environment variable pointing to the root of the miniconda directory
echo $CONDA/bin >> $GITHUB_PATH
- name: Setup Conda environment
run: |
conda env update --file environment.yml --name base
activate-environment: rnaseq-pipeline
environment-file: environment.yml
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be included immediately in the trunk.

- name: Install package
run: |
pip install .[gsheet,webviewer]
Expand All @@ -29,7 +25,7 @@ jobs:
make -C scripts
- name: Configure the pipeline
run: |
cp example.luigi.cfg luigi.cfg
cp tests/luigi.cfg ./
- name: Test with pytest
run: |
conda install pytest
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ install: install-python install-systemd-units install-RSEM install-scripts insta

install-fish-completion:
mkdir -p "${DESTDIR}/etc/fish/completions"
install data/luigi.fish "${DESTDIR}/etc/fish/completions/"
install -m644 data/luigi.fish "${DESTDIR}/etc/fish/completions/"

install-scripts:
$(MAKE) -C scripts install
Expand All @@ -26,7 +26,7 @@ install-python:

install-systemd-units:
mkdir -p "${DESTDIR}/etc/systemd/system/"
install data/systemd/*.{service,timer,target} "${DESTDIR}/etc/systemd/system/"
install -m644 data/systemd/*.{service,timer} "${DESTDIR}/etc/systemd/system/"
@echo "Remember to run 'systemctl override rnaseq-pipeline-viewer' and 'systemctl override rnaseq-pipeline-worker@' and set CONDA_BIN, CONDA_ENV, GEMMA_USERNAME and GEMMA_PASSWORD environment variables."

install-RSEM:
Expand Down
40 changes: 31 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,7 @@ your tasks at http://localhost:8082/.
luigid
```

For convenience, we provide a `luigi-wrapper` script that sets the `--module`
flag to `rnaseq_pipeline.tasks` for you.
For convenience, we provide a `rnaseq-pipeline-cli` tool to run high-level tasks:

```bash
luigi-wrapper <task> <task_args>
Expand Down Expand Up @@ -102,13 +101,16 @@ The output is organized as follow:

```
pipeline-output/
genomes/<reference_id>/ # Genomic references
references/<reference_id>/ # RSEM/STAR indexes
data/<source> # FASTQs (note that GEO source uses SRA)
data-qc/<experiment_id>/<sample_id>/ # FastQC reports
aligned/<reference_id>/<experiment_id>/ # alignments and quantification results
quantified/<reference_id> # quantification matrices for isoforms and genes
report/<reference_id>/<experiment_id>/ # MultiQC reports for reads and alignments
genomes/<reference_id>/ # Genomic references
references/<reference_id>/ # RSEM/STAR indexes
references-single-cell/<reference_id>/ # Cell Ranger references
data/<source>/ # FASTQs (organization is source-specific; note that GEO source uses SRA)
data-qc/<experiment_id>/<sample_id>/ # FastQC reports
data-single-cell/<experiment_id>/<sample_id>/ # Single-cell data (hard links to files from data/)
aligned/<reference_id>/<experiment_id>/ # alignments and quantification results
quantified/<reference_id> # quantification matrices for isoforms and genes
quantified-single-cell/<reference_id> # quantified single-cell data (Cell Ranger outputs)
report/<reference_id>/<experiment_id>/ # MultiQC reports for reads and alignments
```

You can adjust the pipeline output directory by setting `OUTPUT_DIR` under
Expand Down Expand Up @@ -146,6 +148,26 @@ pip install .[webviewer]
gunicorn rnaseq_pipeline.viewer:app
```

## Tools Wrappers

A few wrappers are provided to make some tools run more efficiently. For this to work, you have to configure Bioluigi to
use the wrappers instead of the actual tools.

Examples of behaviors:

- copy the reference directory to a local scratch directory (Cell Ranger & RSEM)
- preload genome reference in shared memory and release unused ones (RSEM only)

```ini
[bioluigi]
cellranger_bin=rnaseq-pipeline-cellranger
rsem_calculate_expression_bin=rnaseq-pipeline-rsem-calculate-expression

[rnaseq_pipeline.wrapped_tools]
cellranger_bin=/absolute/path/to/cell/ranger/bin
rsem_calculate_expression_bin=/absolute/path/to/rsem
```

## Gemma integration

The RNA-Seq pipeline is capable of communicating with Gemma using its [RESTful API](https://gemma.msl.ubc.ca/resources/restapidocs/).
Expand Down
8 changes: 5 additions & 3 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@ channels:
- bioconda
- nodefaults
dependencies:
- python=3.10
- python=3.12
- pip
- cutadapt==4.8
- multiqc==1.29
- cutadapt==4.9
- multiqc==1.32
- polars-lts-cpu # for our older servers that lack support for AVX2
- sra-tools
- fastqc==0.12.1
- star==2.7.3a
- entrez-direct
- perl # rsem expects this
- samtools
- curl
24 changes: 22 additions & 2 deletions example.luigi.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,12 @@ submit_data_jobs=1
submit_batch_info_jobs=2

[bioluigi]
scheduler=slurm
scheduler=local
scheduler_partition=
scheduler_extra_args=[]
# Default tools, override as needed
#cutadapt_bin=cutadapt
#cell_ranger_bin=cellranger

#
# This section contains the necessary variables for the pipeline execution
Expand All @@ -40,19 +43,33 @@ scheduler_extra_args=[]
OUTPUT_DIR=pipeline-output
GENOMES=genomes
REFERENCES=references
SINGLE_CELL_REFERENCES=references-single-cell
METADATA=metadata
DATA=data
DATAQCDIR=data-qc
ALIGNDIR=aligned
ALIGNQCDIR=aligned-qc
QUANTDIR=quantified
BATCHINFODIR=batch-info

# RSEM
RSEM_DIR=contrib/RSEM
rsem_calculate_expression_bin=contrib/RSEM/rsem-calculate-expression

SLACK_WEBHOOK_URL=

[rnaseq_pipeline.wrapped_tools]
rsem_calculate_expression_bin=rsem-calculate-expression
cellranger_bin=cellranger

[rnaseq_pipeline.sources.sra]
# location where tools like prefetch and fastq-dump will store downloaded SRA files
# you can get this value with vdb-config -p
ncbi_public_dir=/cosmos/scratch/ncbi/public
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've encountered issues with parsing the output of vdb-config, so this is a more robust solution overall.

samtools_bin=samtools
bamtofastq_bin=bamtofastq
# location where BAM headers downloaded from SRA will be cached
bam_headers_cache_dir=bam_headers

[rnaseq_pipeline.gemma]
cli_bin=gemma-cli
# values for $JAVA_HOME and $JAVA_OPTS environment variables
Expand All @@ -63,3 +80,6 @@ appdata_dir=/space/gemmaData
human_reference_id=hg38_ncbi
mouse_reference_id=mm10_ncbi
rat_reference_id=rn7_ncbi
human_single_cell_reference_id=refdata-gex-GRCh38-2024-A
mouse_single_cell_reference_id=refdata-gex-GRCm39-2024-A
rat_single_cell_reference_id=refdata-gex-mRatBN7-2-2024-A
36 changes: 36 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
name = "rnaseq-pipeline"
version = "2.1.12"
description = "RNA-Seq pipeline for the Pavlidis Lab"
authors = [
{name = "Guillaume Poirier-Morency", email = "poirigui@msl.ubc.ca"}
]
readme = "README.md"
license = "Unlicense"
license-files = ["LICENSE"]
requires-python = "==3.12.*"
dependencies = ['luigi', 'python-daemon<3.0.0',
'bioluigi@git+https://github.com/PavlidisLab/bioluigi@master',
'requests', 'pandas']

[project.optional-dependencies]
gsheet = ['google-api-python-client', 'google-auth-httplib2', 'google-auth-oauthlib', 'pyxdg']
webviewer = ['Flask', 'gunicorn']

[dependency-groups]
dev = ["pytest", "mypy"]

[project.scripts]
rnaseq-pipeline-cli = "rnaseq_pipeline.cli:main"
rnaseq-pipeline-cellranger = "rnaseq_pipeline.wrapped_tools:cellranger_wrapper"
rnaseq-pipeline-rsem-calculate-expression = "rnaseq_pipeline.wrapped_tools:rsem_calculate_expression_wrapper"

[tool.setuptools]
packages = ["rnaseq_pipeline", "rnaseq_pipeline.sources", "rnaseq_pipeline.webviewer"]

[tool.mypy]
plugins = ["luigi.mypy"]
4 changes: 4 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[pytest]
testpaths=tests
log_cli=1
log_cli_level=info
4 changes: 4 additions & 0 deletions rnaseq_pipeline/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
import luigi

luigi.auto_namespace(scope=__name__)

from rnaseq_pipeline.tasks import *
from rnaseq_pipeline.sources.sra import *

90 changes: 90 additions & 0 deletions rnaseq_pipeline/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
import argparse
import sys
import os
from contextlib import contextmanager

import luigi
import luigi.cmdline

from rnaseq_pipeline.tasks import SubmitExperimentToGemma, SubmitExperimentsFromGoogleSpreadsheetToGemma, \
SubmitExperimentBatchInfoToGemma

@contextmanager
def umask(umask):
print(f'Setting umask to 0x{umask:03o}')
prev_umask = os.umask(umask)
try:
yield None
finally:
print(f'Restoring umask to 0x{prev_umask:03o}')
os.umask(prev_umask)

def parse_octal(s):
return int(s, 8)

def run_luigi_task(task, args):
with umask(args.umask):
luigi.build([task], workers=args.workers, detailed_summary=True, local_scheduler=args.local_scheduler)

def run(args):
with umask(0o002):
luigi.run(args)

def submit_experiment(argv):
parser = argparse.ArgumentParser()
parser.add_argument('--experiment-id', required=True, help='Experiment ID to submit to Gemma')
parser.add_argument('--rerun', action='store_true', default=False, help='Rerun the experiment')
parser.add_argument('--priority', type=int, default=100)
parser.add_argument('--umask', type=parse_octal, default='002',
help='Set a umask (defaults to 002 to make created files group-writable)')
parser.add_argument('--workers', type=int, default=30, help='Number of workers to use (defaults to 30)')
parser.add_argument('--local-scheduler', action='store_true', default=False)
args = parser.parse_args(argv)
run_luigi_task(SubmitExperimentToGemma(experiment_id=args.experiment_id, rerun=args.rerun, priority=args.priority),
args)

def submit_experiment_batch_info(argv):
parser = argparse.ArgumentParser()
parser.add_argument('--experiment-id', required=True, help='Experiment ID to submit to Gemma')
parser.add_argument('--ignored-samples', nargs='+', default=[])
parser.add_argument('--rerun', action='store_true', default=False, help='Rerun the experiment')
parser.add_argument('--umask', type=parse_octal, default='002',
help='Set a umask (defaults to 002 to make created files group-writable)')
parser.add_argument('--workers', type=int, default=30, help='Number of workers to use (defaults to 30)')
parser.add_argument('--local-scheduler', action='store_true', default=False)
args = parser.parse_args(argv)
print(args.ignored_samples)
run_luigi_task(
SubmitExperimentBatchInfoToGemma(experiment_id=args.experiment_id, ignored_samples=args.ignored_samples,
rerun=args.rerun), args)

def submit_experiments_from_gsheet(argv):
parser = argparse.ArgumentParser()
parser.add_argument('--spreadsheet-id', required=True, help='Spreadsheet ID')
parser.add_argument('--sheet-name', required=True, help='Sheet name')
parser.add_argument('--umask', type=parse_octal, default='002',
help='Set a umask (defaults to 002 to make created files group-writable)')
parser.add_argument('--workers', type=int, default=200, help='Number of workers to use (defaults to 200)')
parser.add_argument('--ignore-priority', action='store_true', help='Ignore the priority column in the spreadsheet')
parser.add_argument('--local-scheduler', action='store_true', default=False)
args = parser.parse_args(argv)
run_luigi_task(SubmitExperimentsFromGoogleSpreadsheetToGemma(args.spreadsheet_id, args.sheet_name,
ignore_priority=args.ignore_priority), args)

def main():
if len(sys.argv) < 2:
print('Usage: rnaseq-pipeline-cli <command>')
return 1
command = sys.argv[1]
if command == 'run':
return run(sys.argv[2:])
elif command == 'submit-experiment':
return submit_experiment(sys.argv[2:])
elif command == 'submit-experiment-batch-info':
return submit_experiment_batch_info(sys.argv[2:])
elif command == 'submit-experiments-from-gsheet':
return submit_experiments_from_gsheet(sys.argv[2:])
else:
print(
f'Unknown command {command}. Possible values are: submit-experiment, submit-experiment-batch-info, submit-experiments-from-gsheet.')
return 1
35 changes: 21 additions & 14 deletions rnaseq_pipeline/config.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,28 @@
from typing import Optional

import luigi

# see luigi.cfg for details
class rnaseq_pipeline(luigi.Config):
task_namespace = ''
class Config(luigi.Config):
@classmethod
def get_task_family(cls):
return 'rnaseq_pipeline'

OUTPUT_DIR: str = luigi.Parameter(default='pipeline-output')

GENOMES = luigi.Parameter()
GENOMES: str = luigi.Parameter(default='genomes')
REFERENCES: str = luigi.Parameter(default='references')
SINGLE_CELL_REFERENCES: str = luigi.Parameter(default='references-single-cell')
METADATA: str = luigi.Parameter(default='metadata')
DATA: str = luigi.Parameter(default='data')
DATAQCDIR: str = luigi.Parameter(default='data-qc')
ALIGNDIR: str = luigi.Parameter(default='aligned')
QUANTDIR: str = luigi.Parameter(default='quantified')
QUANT_SINGLE_CELL_DIR: str = luigi.Parameter(default='quantified-single-cell')
BATCHINFODIR: str = luigi.Parameter(default='batch-info')

OUTPUT_DIR = luigi.Parameter()
REFERENCES = luigi.Parameter()
METADATA = luigi.Parameter()
DATA = luigi.Parameter()
DATAQCDIR = luigi.Parameter()
ALIGNDIR = luigi.Parameter()
ALIGNQCDIR = luigi.Parameter()
QUANTDIR = luigi.Parameter()
BATCHINFODIR = luigi.Parameter()
RSEM_DIR: str = luigi.Parameter(default='contrib/RSEM')

RSEM_DIR = luigi.Parameter()
rsem_calculate_expression_bin: str = luigi.Parameter(default='contrib/RSEM/rsem-calculate-expression')

SLACK_WEBHOOK_URL = luigi.OptionalParameter(default=None)
SLACK_WEBHOOK_URL: Optional[str] = luigi.OptionalParameter(default=None)
Loading
Loading