Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pretrain/installers/v4-upstream-megatron-abci/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
ABCI 3.0上で以下のコマンドを実行し、`<env_install_path>`に環境を構築できる

```bash
cd pretrain/installers/v5-megatron-abci/
cd pretrain/installers/v4-upstream-megatron-abci/
bash run_setup.sh <env_install_path>
```

Expand Down
14 changes: 7 additions & 7 deletions pretrain/scripts/v4-upstream-training-template/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ ABCI 3.0 上で Megatron-LM を利用した LLM-jp v5 用の学習スクリプ

```bash
cd $EXP_DIR
git clone git@github.com:llm-jp/scripts.git
git clone https://github.com/llm-jp/scripts.git
```

次に、 [pretrain/installers/v5-megatron-abci](../../installers/v5-megatron-abci/README.md) を利用し、`$EXP_DIR/env` に環境を構築する。
Expand Down Expand Up @@ -47,20 +47,22 @@ cp -r scripts/pretrain/task_template/ $EXP_DIR/tasks/$TASK_NAME

```bash
cd $EXP_DIR/scripts/pretrain/$TRAINING_SCRIPT_DIR/
bash run_train.sh <RESERVATION_ID> <EXPERIMENT_ID> <EXPERIMENT_DIR> <TASK_NAME> <WANDB_PROJECT> <NUM_NODES>
bash run_train.sh <GROUP_ID> <RESERVATION_ID> <JOB_NAME> <EXPERIMENT_DIR> <TASK_NAME> <WANDB_PROJECT> <NUM_NODES> <WALLTIME>

# Example:
bash run_train.sh R0123456789 0123 /path/to/0123_experiment task_name 0123_experiment 32
bash run_train.sh gcg51557 R0123456789 0123_pretrain /path/to/0123_experiment task_name 0123_experiment 32 720:00:00
```

CLIからは以下の引数を指定する

- `<GROUP_ID>`: ABCI グループ ID
- `<RESERVATION_ID>`: ABCI の予約キュー ID
- `<EXPERIMENT_ID>`: 実験の識別子 (e.g. `0123`)
- `<EXPERIMENT_DIR>`: 実験ディレクトリのパス (e.g. `/home/ach17726fj/experiments/0123_experiment`)
- `<JOB_NAME>`: ジョブ名 (e.g., `0123_pretrain`)
- `<EXPERIMENT_DIR>`: 実験ディレクトリのパス (e.g. `/path/to/0123_experiment`)
- `<TASK_NAME>`: タスクディレクトリ名 (e.g. `task_name`)
- `<WANDB_PROJECT>`: WandB に記録するプロジェクト名 (e.g. `0123_experiment`)
- `<NUM_NODES>`: 使用するノード数 (e.g. `32`)
- `<WALLTIME>`: ジョブの制限時間 (e.g., `720:00:00`)

### Training Configuration

Expand All @@ -70,5 +72,3 @@ CLIからは以下の引数を指定する
- Megatron-LM の `pretrain_gpt.py` に渡す引数をこのファイル内の変数に定義する
- `train_data.sh`: 学習データのパス及び利用するトークン数などを定義するスクリプト
- Megatron-LM の `--train-data` 引数に渡す値をこのファイル内の `$TRAIN_DATA_PATH` 変数に定義する
- `train_iters.txt`: 学習イテレーション数を定義するファイル
- 学習するイテレーション数を記載し、他には何も記載しない
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
# * TASK_NAME: Name of the task
# * WANDB_PROJECT: W&B project name

set -eu -o pipefail

cd ${PBS_O_WORKDIR}

TASK_DIR=${EXPERIMENT_DIR}/tasks/${TASK_NAME}
Expand All @@ -15,8 +17,6 @@ LOGFILE=${TASK_DIR}/logs/pretrain-${JOB_ID}.out
ERRFILE=${TASK_DIR}/logs/pretrain-${JOB_ID}.err
exec > ${LOGFILE} 2> ${ERRFILE}

set -eu -o pipefail

ENV_DIR=${EXPERIMENT_DIR}/env
SCRIPT_DIR=${EXPERIMENT_DIR}/scripts

Expand Down Expand Up @@ -55,21 +55,19 @@ echo "hostname: ${MASTER_ADDR}"
NUM_NODES=$(wc -l < ${PBS_NODEFILE})
NUM_GPUS_PER_NODE=8
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE}))
echo "nnodes: ${NUM_NODES}; ngpus: ${NUM_GPUS}"
echo NUM_NODES=${NUM_NODES}
echo NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE}
echo NUM_GPUS=${NUM_GPUS}

# For logging
echo "PBS_NODEFILE:"
cat ${PBS_NODEFILE}

# Training steps
TRAIN_ITERS=$(cat ${TASK_DIR}/train_iters.txt)

# Training data: TRAIN_DATA_PATH
# Load training data: TRAIN_DATA_PATH
source ${TASK_DIR}/train_data.sh

# Synthesize all model params: ALL_PARAMS
# Requires TRAIN_ITERS and TRAIN_DATA_PATH
# Load model params: ALL_PARAMS
# Requires TRAIN_DATA_PATH
source ${TASK_DIR}/params.sh

# Add logging params
Expand All @@ -89,8 +87,10 @@ ALL_PARAMS+=(
--save-interval 1000
)

# For logging
echo "ALL_PARAMS: ${ALL_PARAMS[@]}"

echo "Start training..."
mpirun \
--display-allocation \
--report-bindings \
Expand All @@ -102,3 +102,5 @@ mpirun \
python \
${ENV_DIR}/src/Megatron-LM/pretrain_gpt.py \
${ALL_PARAMS[@]}

echo "Training completed successfully."
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,30 @@

set -eu -o pipefail

if [ $# -ne 6 ]; then
>&2 echo "Usage: $0 <RESERVATION_ID> <EXPERIMENT_ID> <EXPERIMENT_DIR> <TASK_NAME> <WANDB_PROJECT> <NUM_NODES>"
>&2 echo "Example: $0 R0123456789 0123 /path/to/0123_experiment task_name 0123_experiment 32"
if [ $# -ne 8 ]; then
>&2 echo "Usage: $0 <GROUP_ID> <RESERVATION_ID> <JOB_NAME> <EXPERIMENT_DIR> <TASK_NAME> <WANDB_PROJECT> <NUM_NODES> <WALLTIME>"
>&2 echo "Example: $0 gcg51557 R0123456789 0123 /path/to/0123_experiment task_name 0123_experiment 32 720:00:00"
exit 1
fi

GROUP_ID=$1; shift
RESERVATION_ID=$1; shift
EXPERIMENT_ID=$1; shift
JOB_NAME=$1; shift
EXPERIMENT_DIR=$1; shift
TASK_NAME=$1; shift
WANDB_PROJECT=$1; shift
NUM_NODES=$1; shift
WALLTIME=$1; shift

# This directory
SCRIPT_ROOT=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

WALLTIME=720:00:00 # 30 days
# WALLTIME=01:00:00 # 1 hour

qsub \
-P gcg51557 \
-P ${GROUP_ID} \
-q ${RESERVATION_ID} \
-N ${EXPERIMENT_ID}_pretrain \
-N ${JOB_NAME} \
-l select=${NUM_NODES},walltime=${WALLTIME} \
-v RTYPE=rt_HF,EXPERIMENT_DIR=${EXPERIMENT_DIR},TASK_NAME=${TASK_NAME},WANDB_PROJECT=${WANDB_PROJECT} \
-v RTYPE=rt_HF,USE_SSH=1,EXPERIMENT_DIR=${EXPERIMENT_DIR},TASK_NAME=${TASK_NAME},WANDB_PROJECT=${WANDB_PROJECT} \
-o /dev/null \
-e /dev/null \
-m n \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,9 @@ ALL_PARAMS+=(

# Scheduler
ALL_PARAMS+=(
--train-iters ${TRAIN_ITERS}
--train-iters 100000
--lr-warmup-iters 2000
--lr-decay-iters ${TRAIN_ITERS}
--lr-decay-iters 100000
--lr-decay-style cosine
--eval-interval 999999999
--eval-iters 0
Expand Down

This file was deleted.