RaptScore

This repository contains the code for "RaptScore: a large language model-based algorithm for versatile aptamer evaluation.", including implementation of DNABERT as a submodule. Additionally, some other parts of the code also utilize and modify the implementation of DNABERT.

Setup

Prerequisites

Python 3.10 or higher
CUDA-compatible GPU (recommended)
uv package manager

Quick Setup

Install uv (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone and setup:

git clone --recurse-submodules git@github.com:hmdlab/RaptScore.git
cd RaptScore
uv sync

Jupyter setup (optional):
```
make install-kernel
```
Verify:
```
make verify
```

Environment Details

Python: 3.10+
PyTorch: 2.8.0+cu126 with CUDA 12.6 support
DNABERT: Custom transformers 2.5.0 (from submodule)
Package Manager: uv with pyproject.toml configuration

Troubleshooting

DNABERT Tokenizers Build Error

If you encounter tokenizers build issues, install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
uv sync

Usage

Data preparation
1. Aggregate the occurrence count of each sequence across all rounds in SELEX data.
  - e.g. uv run src/fastq_count_analyzer.py DatasetA
  - Before using, please download the FASTQ files and update the file_path field in config/fastq_to_csv_config.json.
  - A sample output file is provided at data/DatasetA/sample.csv
2. When using a new dataset, add the configuration to config/pretrain_config.json.
  - r_num_array: Array of SELEX round numbers to include in the analysis. Example: [1, 2, 3, 4, 5, 6].
  - csv_path: Path to the sequence CSV file. Example: "data/DatasetA/score.csv".
  - duplication_array: Array indicating duplication patterns to consider. Example: ["+", "log", "-"].
  - randomonly_array: Array specifying whether to use only random region or not. Example: ["+", "-"].
  - fwd: Forward primer sequence used in SELEX. Example: "GGGAAGCTCCGTCGAGCT".
  - rev: Reverse primer sequence used in SELEX. Example: "TACGCCTGCGTAGCTCCT".
  - targetlength: Integer specifying the random region length. Example: 35.
3. Preprocess data for pretraining under various conditions.
  - e.g. uv run python src/pretrain_process_data.py DatasetA
Continual pretraining of DNABERT
1. Download 3-new-12w-0 model(DNABERT-3 model) as instructed in DNABERT repository and put it in /models. Make sure to rename the DNABERT-3 model directory name to 3-new-12w-0.
2. When using a new dataset, add the configuration to config/pretrain_config.json.
3. run pretrain command
  - e.g. uv run python src/pretrain.py A
4. models are saved in the directory models/.
RaptScore calculation
- all score patterns
  1. Calculate token-level log-likelihood.
    - Provide the dataset name and the path to the sequence file for calculation.
      - Example: uv run python src/calc_loglikelihood.py --dataname DatasetA --dataforcalc data/sample_data.csv
    - For a new dataset, ensure that the configuration is updated in config/calc_loglikelihood_config.json.
    - Log-likelihood files will be generated in the directory out/loglikelihood/{datasetname}/{executed_datetime}.
  2. Calculate RaptScore and evaluate correlation.
    - Provide the path to the sequence file for calculation, the dataset name, and the log-likelihood directory path generated by calc_loglikelihood.py.
      - Example: uv run python src/calc_raptscore.py --dataforcalc data/sample_data.csv DatasetA out/loglikelihood/DatasetA/20251231235959
    - For a new dataset, ensure the configuration is added to config/calc_raptscore_config.json.
    - CSV files containing correlation values and RaptScores for each aptamer will be generated in the out/raptscore directory. Combine these files to identify the RaptScore settings with the highest correlation.
- specific score pattern
  1. Calculate token-level log-likelihood.
    - Specify the dataset name, SELEX round number of best RaptScore pattern, the best RaptScore pattern CSV file (generated by calc_raptscore.py at out/raptscore), and the path to the sequence data file for calculation.
      - Example: uv run python src/calc_loglikelihood.py --dataname DatasetA --calc_round_nums 4 --dataforcalc data/sample_data.csv --maxcoefpattern DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv
    - Log-likelihood files will be generated in the directory out/loglikelihood/{datasetname}/{executed_datetime}.
  2. Calculate RaptScore and evaluate correlation.
    - Specify the dataset name, SELEX round number of best RaptScore pattern, the best RaptScore pattern CSV file name (generated by calc_raptscore.pyat out/raptscore), path for the directory of log-likelihood files (generatedand by step 1) and the path to the sequence data file for calculation.
      - Example: uv run python src/calc_raptscore.py --dataforcalc data/sample_data.csv --roundnums 4 --maxcoefpattern DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv DatasetA out/loglikelihood/DatasetA/20251231235959
    - CSV files containing correlation values and RaptScores for each aptamer will be generated in the out/raptscore directory.
RaptScore + genetic algorithm
- Specify the path to the sequence file for calculation, the SELEX round number of the best RaptScore pattern, and the best RaptScore pattern CSV file (generated by calc_raptscore.py).
  - Example: uv run python src/nsga.py --seq_file_path data/ga/ga_sample.csv --round_num 4 --selex_data_name DatasetA DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv
- The generated sequences for each round will be saved in the directory out/ga/.
  - Example: out/ga/20251231_235959_result_DatasetA_ps100_mg2_nsga3

Hardware Configuration Notice

The original experiments were performed on a 4-GPU system, and results may vary under different GPU configurations.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
config		config
data		data
models		models
out		out
src		src
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RaptScore

Setup

Prerequisites

Quick Setup

Environment Details

Troubleshooting

DNABERT Tokenizers Build Error

Usage

Hardware Configuration Notice

About

Uh oh!

Releases 1

Packages

Languages

License

hmdlab/RaptScore

Folders and files

Latest commit

History

Repository files navigation

RaptScore

Setup

Prerequisites

Quick Setup

Environment Details

Troubleshooting

DNABERT Tokenizers Build Error

Usage

Hardware Configuration Notice

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages