This repository contains the code for "RaptScore: a large language model-based algorithm for versatile aptamer evaluation.", including implementation of DNABERT as a submodule. Additionally, some other parts of the code also utilize and modify the implementation of DNABERT.
- Python 3.10 or higher
- CUDA-compatible GPU (recommended)
- uv package manager
-
Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh -
Clone and setup:
git clone --recurse-submodules git@github.com:hmdlab/RaptScore.git cd RaptScore uv sync -
Jupyter setup (optional):
make install-kernel
-
Verify:
make verify
- Python: 3.10+
- PyTorch: 2.8.0+cu126 with CUDA 12.6 support
- DNABERT: Custom transformers 2.5.0 (from submodule)
- Package Manager: uv with pyproject.toml configuration
If you encounter tokenizers build issues, install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
uv sync-
Data preparation
- Aggregate the occurrence count of each sequence across all rounds in SELEX data.
- e.g.
uv run src/fastq_count_analyzer.py DatasetA - Before using, please download the FASTQ files and update the
file_pathfield inconfig/fastq_to_csv_config.json. - A sample output file is provided at
data/DatasetA/sample.csv
- e.g.
- When using a new dataset, add the configuration to
config/pretrain_config.json.r_num_array: Array of SELEX round numbers to include in the analysis. Example:[1, 2, 3, 4, 5, 6].csv_path: Path to the sequence CSV file. Example:"data/DatasetA/score.csv".duplication_array: Array indicating duplication patterns to consider. Example:["+", "log", "-"].randomonly_array: Array specifying whether to use only random region or not. Example:["+", "-"].fwd: Forward primer sequence used in SELEX. Example:"GGGAAGCTCCGTCGAGCT".rev: Reverse primer sequence used in SELEX. Example:"TACGCCTGCGTAGCTCCT".targetlength: Integer specifying the random region length. Example:35.
- Preprocess data for pretraining under various conditions.
- e.g.
uv run python src/pretrain_process_data.py DatasetA
- e.g.
- Aggregate the occurrence count of each sequence across all rounds in SELEX data.
-
Continual pretraining of DNABERT
- Download 3-new-12w-0 model(DNABERT-3 model) as instructed in DNABERT repository and put it in /models. Make sure to rename the DNABERT-3 model directory name to 3-new-12w-0.
- When using a new dataset, add the configuration to
config/pretrain_config.json. - run pretrain command
- e.g.
uv run python src/pretrain.py A
- e.g.
- models are saved in the directory
models/.
-
RaptScore calculation
- all score patterns
- Calculate token-level log-likelihood.
- Provide the dataset name and the path to the sequence file for calculation.
- Example:
uv run python src/calc_loglikelihood.py --dataname DatasetA --dataforcalc data/sample_data.csv
- Example:
- For a new dataset, ensure that the configuration is updated in
config/calc_loglikelihood_config.json. - Log-likelihood files will be generated in the directory
out/loglikelihood/{datasetname}/{executed_datetime}.
- Provide the dataset name and the path to the sequence file for calculation.
- Calculate RaptScore and evaluate correlation.
- Provide the path to the sequence file for calculation, the dataset name, and the log-likelihood directory path generated by
calc_loglikelihood.py.- Example:
uv run python src/calc_raptscore.py --dataforcalc data/sample_data.csv DatasetA out/loglikelihood/DatasetA/20251231235959
- Example:
- For a new dataset, ensure the configuration is added to
config/calc_raptscore_config.json. - CSV files containing correlation values and RaptScores for each aptamer will be generated in the
out/raptscoredirectory. Combine these files to identify the RaptScore settings with the highest correlation.
- Provide the path to the sequence file for calculation, the dataset name, and the log-likelihood directory path generated by
- Calculate token-level log-likelihood.
- specific score pattern
- Calculate token-level log-likelihood.
- Specify the dataset name, SELEX round number of best RaptScore pattern, the best RaptScore pattern CSV file (generated by
calc_raptscore.pyatout/raptscore), and the path to the sequence data file for calculation.- Example:
uv run python src/calc_loglikelihood.py --dataname DatasetA --calc_round_nums 4 --dataforcalc data/sample_data.csv --maxcoefpattern DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv
- Example:
- Log-likelihood files will be generated in the directory
out/loglikelihood/{datasetname}/{executed_datetime}.
- Specify the dataset name, SELEX round number of best RaptScore pattern, the best RaptScore pattern CSV file (generated by
- Calculate RaptScore and evaluate correlation.
- Specify the dataset name, SELEX round number of best RaptScore pattern, the best RaptScore pattern CSV file name (generated by
calc_raptscore.pyatout/raptscore), path for the directory of log-likelihood files (generatedand by step 1) and the path to the sequence data file for calculation.- Example:
uv run python src/calc_raptscore.py --dataforcalc data/sample_data.csv --roundnums 4 --maxcoefpattern DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv DatasetA out/loglikelihood/DatasetA/20251231235959
- Example:
- CSV files containing correlation values and RaptScores for each aptamer will be generated in the
out/raptscoredirectory.
- Specify the dataset name, SELEX round number of best RaptScore pattern, the best RaptScore pattern CSV file name (generated by
- Calculate token-level log-likelihood.
- all score patterns
-
RaptScore + genetic algorithm
- Specify the path to the sequence file for calculation, the SELEX round number of the best RaptScore pattern, and the best RaptScore pattern CSV file (generated by
calc_raptscore.py).- Example:
uv run python src/nsga.py --seq_file_path data/ga/ga_sample.csv --round_num 4 --selex_data_name DatasetA DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv
- Example:
- The generated sequences for each round will be saved in the directory
out/ga/.- Example:
out/ga/20251231_235959_result_DatasetA_ps100_mg2_nsga3
- Example:
- Specify the path to the sequence file for calculation, the SELEX round number of the best RaptScore pattern, and the best RaptScore pattern CSV file (generated by
- The original experiments were performed on a 4-GPU system, and results may vary under different GPU configurations.