Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions pretrain/scripts/v4-midtraining-with-v3.1-tokenizer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# LLMjp-v4 Midtraining

## Overview

MegaMathPro-Maxを含めた実験

### tokenize

```bash
export EXP_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/"
export EXP_SCRIPT_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining-with-v3.1-tokenizer"
cd $EXP_DIR

# 1. Huggingfaceからdolmino-mix-1124をダウンロード
huggingface-cli download allenai/dolmino-mix-1124 --local-dir "$EXP_DIR/dolmino-mix-1124"

cd $EXP_SCRIPT_DIR
# 2. データセットの展開 (`$EXP_DIR/dolmino-mix-1124-extracted` に展開される)
bash ./preprocess/extract.sh

# 3. データセットファイルのmerge (`$EXP_DIR/dolmino-mix-1124-extracted-merged` に結合ファイルが作成される)
qsub ./preprocess/merge_files.sh

# (3が完了したら)
# 4. データセットのtokenize (`$EXP_DIR/dolmino-mix-1124-tokenized` にtokenizeされたファイルが作成される)
qsub ./preprocess/tokenize.sh

# (optional) 中間ファイルの削除
rm -rf $EXP_DIR/dolmino-mix-1124-extracted $EXP_DIR/dolmino-mix-1124-extracted-merged
```

### データセットの作成

データセットの作成前に事前にtokenizeが完了している必要がある。

```sh
# ./tasks/v4-dolmino-mix-1124/train_data.all.shを作成
# 自動的にtoken数を計算し、"token数 PATH"をtrain_data.all.shに書き込む
./preprocess/build_train_data.sh

# ./tasks/v4-dolmino-mix-1124/train_data.all.shから./tasks/v4-dolmino-mix-1124/train_data_50B.shを作成
# dolminoのmidtrainingと同じ配合の50Bのデータセットサイズになるようにtoken数を更新する
./preprocess/update_train_data_to_50B.sh
# 100B, 300Bも同様
```

## 環境構築

ref: [scripts/pretrain/installers/v4-megatron-abci at 0130-instruct-pretrain · llm-jp/scripts](https://github.com/llm-jp/scripts/tree/0130-instruct-pretrain/pretrain/installers/v4-megatron-abci)

```sh
cd /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/install-scripts/pretrain/installers/v4-megatron-abci
bash run_setup.sh /path/to/target_dir
# ex
# bash run_setup.sh /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/environment
```

> [!CAUTION]
> Transformer engineのv1.10以上を使うとエラーが出るため、environment2を今回利用している(Transformer engineのversionを1.9にdowngradeした。)
> ref: https://docs.nvidia.com/nemo-framework/user-guide/24.07/knownissues.html

> [!CAUTION]
> `environment/src/Megatron-LM/megatron/core/dist_checkpointing/strategies/common.py`の72行目に"weights_only=False"を加えた
> ref: https://github.com/huggingface/accelerate/issues/3539


## job実行

```sh
cd /path/to/v4-midtraining-with-v3.1-tokenizer

# example:
# 1.3b-llama3-ecjk
bash midtrain/run_train.sh $(realpath tasks/v4-megamath-pro-max) 7.7b_v4_3.5t_tokenizer_v3.1 80B 16
```

### [Option] 依存関係付きのjob実行

qsub の `-W depend=...` の機能を利用して、ジョブ間に依存関係をつけて実行するためのスクリプトを用意している。
`run_train.sh` ではなく `run_train_with_deps.sh` を利用して実行する。

```sh
# 最後の引数に `-W depend=` に渡す値を書く
bash midtrain/run_train.sh $(realpath tasks/v4-megamath-pro-max) 7.7b_v4_3.5t_tokenizer_v3.1 80B 16 afterok:xxxx.pbs1:yyyy.pbs1
```

依存関係の詳しい記法は ABCI 3.0 上で `man qsub` を参照すること

## Checkpoint変換

> [!CAUTION]
> 下のスクリプトを実行する前に、`scripts/pretrain/scripts/v4-midtraining/midtrain/params`の`--no-load-optim`を外してください。

```sh
cd /path/to/v4-midtraining-with-v3.1-tokenizer

bash convert/convert_latest.sh {TASK_DIR} {PARAM_NAME} {DATASET_SIZE}

# example:
bash convert/convert_latest.sh $(realpath tasks/v4-megamath-pro-max) 7.7b_v4_3.5t_tokenizer_v3.1 80B
```

> [!CAUTION]
> `/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/environment2/src/Megatron-LM/tools/checkpoint/loader_mcore.py`の先頭に以下のコードを加えた
> ```
> import json, os, sys, torch, functools
> torch.load = functools.partial(torch.load, weights_only=False)
> ```

## Model soup

[arcee-ai/mergekit](https://github.com/arcee-ai/mergekit) を利用して、モデルのマージを行う

モデルマージ用の環境は `$EXP_DIR/venv-mergekit` に用意した

```sh
source $EXP_DIR/venv-mergekit/bin/activate

# 初回にmergekitをインストール
pip install mergekit
```

`./merge/` 配下にマージの設定ファイルを配置している

merge実行コマンド

```sh
mergekit-yaml merge/your_config.yaml model/output/path/
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

# LLM-jp v4 model converter (PBS version)
# Usage:
# bash convert_latest.sh \
# /path/to/task \ ... TASK_DIR: path to the model to save
# v3-13b \ ... PARAM_NAME: model config; corresponding file in `params/` should exist

set -eu -o pipefail

task_dir=$1; shift
param_name=$1; shift
dataset_size=$1; shift # 80B
iter=$(cat ${task_dir}/${param_name}/${dataset_size}/checkpoints/latest_checkpointed_iteration.txt)

script_root=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining-with-v3.1-tokenizer

qsub \
-v TASK_DIR=${task_dir},PARAM_NAME=${param_name},DATASET_SIZE=${dataset_size},ITER=${iter},RTYPE=rt_HF \
-m n \
-o /dev/null \
-e /dev/null \
${script_root}/convert/qsub_convert.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
#!/bin/bash
#PBS -P gcg51557
#PBS -q R9920251000
#PBS -N 0193_convert
#PBS -l select=1
#PBS -o /dev/null
#PBS -e /dev/null
#PBS -m n

cd $PBS_O_WORKDIR

JOBID=${PBS_JOBID%%.*}
mkdir -p ${TASK_DIR}/logs
LOGFILE=${TASK_DIR}/logs/convert-$JOBID.out
ERRFILE=${TASK_DIR}/logs/convert-$JOBID.err
exec > $LOGFILE 2> $ERRFILE

set -eu -o pipefail

# Arguments
EXPERIMENT_DIR=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction
SCRIPT_DIR=${EXPERIMENT_DIR}/scripts/pretrain/scripts/v4-midtraining-with-v3.1-tokenizer/midtrain
ENV_DIR=${EXPERIMENT_DIR}/environment3
echo "EXPERIMENT_DIR=${EXPERIMENT_DIR}"
echo "SCRIPT_DIR=${SCRIPT_DIR}"
echo "TASK_DIR=${TASK_DIR}"
echo "PARAM_NAME=${PARAM_NAME}"
echo "DATASET_SIZE=${DATASET_SIZE}"
echo "ITER=${ITER}"

# Setup environment
source ${SCRIPT_DIR}/common/setup.sh

export MASTER_ADDR=$(head -n 1 $PBS_NODEFILE | hostname -f)
export MASTER_PORT=$((10000 + RANDOM % 1000))
echo "hostname: ${MASTER_ADDR}"

ITER_NAME=iter_$(printf %07d ${ITER}) # iter_0123456

MEGATRON_PATH=${ENV_DIR}/src/Megatron-LM
TOKENIZER_MODEL_PATH=${ENV_DIR}/src/llm-jp-tokenizer/hf/ver3.1/llm-jp-tokenizer-100k.ver3.1 # TODO
OUTPUT_DIR=${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints_hf/${ITER_NAME}
echo "OUTPUT_DIR=${OUTPUT_DIR}"

# Setup working directory
TEMP_DIR=$(mktemp -d "${HOME}/converter_${JOBID}_XXXXXX")
echo "TEMP_DIR=${TEMP_DIR}"
function rm_tempdir {
if [ -e ${TEMP_DIR} ]; then
echo "Removing remporary directory: ${TEMP_DIR}"
rm -rf ${TEMP_DIR}
echo "Done removing"
fi
}
trap rm_tempdir EXIT
trap 'trap - EXIT; rm_tempdir; exit 1' INT PIPE TERM

########
# Step 1: Convert `torch_dist` format to `torch`
# This process requires to launch the trainer script with the same parallelism configs.
########
echo "Start converting: torch_dist --> torch"

# Prepare source model at specific iteration
mkdir ${TEMP_DIR}/torch_dist
echo ${ITER} > ${TEMP_DIR}/torch_dist/latest_checkpointed_iteration.txt
ln -s ${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints/${ITER_NAME} ${TEMP_DIR}/torch_dist/${ITER_NAME}

# Load ALL_PARAMS
source ${SCRIPT_DIR}/params/${PARAM_NAME}.sh
# Remove wandb params
EXCLUDE_KEYS=("--wandb-entity" "--wandb-project" "--wandb-exp-name")
NEW_PARAMS=()
skip_next=0
for param in "${ALL_PARAMS[@]}"; do
if [[ $skip_next -eq 1 ]]; then
skip_next=0
continue
fi
for key in "${EXCLUDE_KEYS[@]}"; do
if [[ "$param" == "$key" ]]; then
skip_next=1
continue 2
fi
done
NEW_PARAMS+=("$param")
done
ALL_PARAMS=("${NEW_PARAMS[@]}")

# Add params specific to model conversion
ALL_PARAMS+=(
--load ${TEMP_DIR}/torch_dist
--ckpt-convert-format torch
--ckpt-convert-save ${TEMP_DIR}
)
echo "ALL_PARAMS: ${ALL_PARAMS[@]}"

NUM_NODES=$(wc -l < $PBS_NODEFILE)
NUM_GPUS_PER_NODE=8
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE}))
echo "nnodes: ${NUM_NODES}; ngpus: ${NUM_GPUS}"
echo NUM_NODES=$NUM_NODES
echo NUM_GPUS_PER_NODE=$NUM_GPUS_PER_NODE
echo NUM_GPUS=$NUM_GPUS

export NVTE_FUSED_ATTN=0
# Launch trainer script to convert the checkpoint
mpirun \
--display-allocation \
--report-bindings \
--oversubscribe \
-np ${NUM_GPUS} \
--npernode ${NUM_GPUS_PER_NODE} \
-bind-to none \
-map-by slot \
python ${MEGATRON_PATH}/pretrain_gpt.py \
${ALL_PARAMS[@]}

#echo "Files created by the Step 1:"
find ${TEMP_DIR}/torch | sort

########
# Step 2: Convert `torch` to `Hugging Face Llama2`
########

echo "Start converting: torch --> hf"

python ${MEGATRON_PATH}/tools/checkpoint/convert.py \
--model-type GPT \
--loader mcore \
--saver llmjp4_hf \
--load-dir ${TEMP_DIR}/torch \
--save-dir ${OUTPUT_DIR} \
--hf-tokenizer-path ${TOKENIZER_MODEL_PATH} \
--save-dtype bfloat16 \
--loader-transformer-impl transformer_engine \
--megatron-path ${MEGATRON_PATH}

echo "Files created by the Step 2:"
find ${OUTPUT_DIR} | sort

########
# Step 3: Replace tokenizer model
########

echo "Start replacing tokenizer"

cp ${TOKENIZER_MODEL_PATH}/* ${OUTPUT_DIR}

echo "Final model files:"
find ${OUTPUT_DIR} | sort

echo "Done processing"
Loading