Fine-tune a small GPT2 LLM with GRPO so it learns to generate correct Python functions for MBPP coding problems (unit-test based reward).
- Demonstrate GRPO on LLM fine-tuning.
- Use MBPP dataset (Hugging Face) as real-world data.
- Run on a laptop or Google Colab (small model / LoRA / quantization recommended).
- Clone repo
- Create virtualenv and install requirements:
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt- Run data prep:
python data_prep.py --out_dir data
- Run training (example):
python train_grpo.py --train_data data/mbpp_train.jsonl --model_name gpt2
--output_dir outputs/grpo_run- Evaluate:
python eval.py --model outputs/grpo_run/checkpoint --eval_data data/
mbpp_valid.jsonl- Training code uses TRL
GRPOTrainer. If you have GPU, set--device cuda. - The reward function executes generated code against MBPP unit tests in a safe subprocess with timeouts and limited I/O.
- For larger models, enable LoRA/QLoRA options to reduce memory use. License: MIT
feel free to reach out if you face any issues.