Skip to content

The code for "RaptScore: a large language model-based algorithm for versatile aptamer evaluation"

License

Notifications You must be signed in to change notification settings

hmdlab/RaptScore

Repository files navigation

RaptScore

This repository contains the code for "RaptScore: a large language model-based algorithm for versatile aptamer evaluation.", including implementation of DNABERT as a submodule. Additionally, some other parts of the code also utilize and modify the implementation of DNABERT.

Setup

Prerequisites

  • Python 3.10 or higher
  • CUDA-compatible GPU (recommended)
  • uv package manager

Quick Setup

  1. Install uv (if not already installed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Clone and setup:

    git clone --recurse-submodules git@github.com:hmdlab/RaptScore.git
    cd RaptScore
    uv sync
  3. Jupyter setup (optional):

    make install-kernel
  4. Verify:

    make verify

Environment Details

  • Python: 3.10+
  • PyTorch: 2.8.0+cu126 with CUDA 12.6 support
  • DNABERT: Custom transformers 2.5.0 (from submodule)
  • Package Manager: uv with pyproject.toml configuration

Troubleshooting

DNABERT Tokenizers Build Error

If you encounter tokenizers build issues, install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
uv sync

Usage

  1. Data preparation

    1. Aggregate the occurrence count of each sequence across all rounds in SELEX data.
      • e.g. uv run src/fastq_count_analyzer.py DatasetA
      • Before using, please download the FASTQ files and update the file_path field in config/fastq_to_csv_config.json.
      • A sample output file is provided at data/DatasetA/sample.csv
    2. When using a new dataset, add the configuration to config/pretrain_config.json.
      • r_num_array: Array of SELEX round numbers to include in the analysis. Example: [1, 2, 3, 4, 5, 6].
      • csv_path: Path to the sequence CSV file. Example: "data/DatasetA/score.csv".
      • duplication_array: Array indicating duplication patterns to consider. Example: ["+", "log", "-"].
      • randomonly_array: Array specifying whether to use only random region or not. Example: ["+", "-"].
      • fwd: Forward primer sequence used in SELEX. Example: "GGGAAGCTCCGTCGAGCT".
      • rev: Reverse primer sequence used in SELEX. Example: "TACGCCTGCGTAGCTCCT".
      • targetlength: Integer specifying the random region length. Example: 35.
    3. Preprocess data for pretraining under various conditions.
      • e.g. uv run python src/pretrain_process_data.py DatasetA
  2. Continual pretraining of DNABERT

    1. Download 3-new-12w-0 model(DNABERT-3 model) as instructed in DNABERT repository and put it in /models. Make sure to rename the DNABERT-3 model directory name to 3-new-12w-0.
    2. When using a new dataset, add the configuration to config/pretrain_config.json.
    3. run pretrain command
      • e.g. uv run python src/pretrain.py A
    4. models are saved in the directory models/.
  3. RaptScore calculation

    • all score patterns
      1. Calculate token-level log-likelihood.
        • Provide the dataset name and the path to the sequence file for calculation.
          • Example: uv run python src/calc_loglikelihood.py --dataname DatasetA --dataforcalc data/sample_data.csv
        • For a new dataset, ensure that the configuration is updated in config/calc_loglikelihood_config.json.
        • Log-likelihood files will be generated in the directory out/loglikelihood/{datasetname}/{executed_datetime}.
      2. Calculate RaptScore and evaluate correlation.
        • Provide the path to the sequence file for calculation, the dataset name, and the log-likelihood directory path generated by calc_loglikelihood.py.
          • Example: uv run python src/calc_raptscore.py --dataforcalc data/sample_data.csv DatasetA out/loglikelihood/DatasetA/20251231235959
        • For a new dataset, ensure the configuration is added to config/calc_raptscore_config.json.
        • CSV files containing correlation values and RaptScores for each aptamer will be generated in the out/raptscore directory. Combine these files to identify the RaptScore settings with the highest correlation.
    • specific score pattern
      1. Calculate token-level log-likelihood.
        • Specify the dataset name, SELEX round number of best RaptScore pattern, the best RaptScore pattern CSV file (generated by calc_raptscore.py at out/raptscore), and the path to the sequence data file for calculation.
          • Example: uv run python src/calc_loglikelihood.py --dataname DatasetA --calc_round_nums 4 --dataforcalc data/sample_data.csv --maxcoefpattern DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv
        • Log-likelihood files will be generated in the directory out/loglikelihood/{datasetname}/{executed_datetime}.
      2. Calculate RaptScore and evaluate correlation.
        • Specify the dataset name, SELEX round number of best RaptScore pattern, the best RaptScore pattern CSV file name (generated by calc_raptscore.pyat out/raptscore), path for the directory of log-likelihood files (generatedand by step 1) and the path to the sequence data file for calculation.
          • Example: uv run python src/calc_raptscore.py --dataforcalc data/sample_data.csv --roundnums 4 --maxcoefpattern DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv DatasetA out/loglikelihood/DatasetA/20251231235959
        • CSV files containing correlation values and RaptScores for each aptamer will be generated in the out/raptscore directory.
  4. RaptScore + genetic algorithm

    • Specify the path to the sequence file for calculation, the SELEX round number of the best RaptScore pattern, and the best RaptScore pattern CSV file (generated by calc_raptscore.py).
      • Example: uv run python src/nsga.py --seq_file_path data/ga/ga_sample.csv --round_num 4 --selex_data_name DatasetA DatasetAfromPtddup+_randonly+_strict+3merExsp+_Exsi-coef_20251231235959vf.csv
    • The generated sequences for each round will be saved in the directory out/ga/.
      • Example: out/ga/20251231_235959_result_DatasetA_ps100_mg2_nsga3

Hardware Configuration Notice

  • The original experiments were performed on a 4-GPU system, and results may vary under different GPU configurations.

About

The code for "RaptScore: a large language model-based algorithm for versatile aptamer evaluation"

Resources

License

Stars

Watchers

Forks

Packages

No packages published