Skip to content

Commit 9331ca2

Browse files
Fix the random handoff and change default param (#14)
* make random default * reset train num and levelfeedback as default * delete files no use * fix che train too less * Config and external overhaul:\n\n- Introduce unified external section (mode, sandbox_slice, original/previous, expert_model).\n- Default external.mode=level_feedback; sandbox_slice=1 (supports 0/None/'all').\n- Handoff handled in CoMLRL trainer with strict modes; expose magrpo/grpo.handoff.\n- Update HumanEval/CHE splits (HE train 33:163, eval :32; CHE train 16:, eval :16).\n- Set output.save_final_model=false by default.\n- Set wandb.dir and output.base_dir to storage paths by trainer/mode:\n * ST GRPO: output_st_grpo, ST MAGRPO: output_st_magrpo\n * MT GRPO: output_mt_grpo, MT MAGRPO: output_mt_magrpo\n- Rename expert model key to external.expert_model (used only for expert_edits).\n- Simplify YAML comments to section headers only.\n- Read external.* in train_magrpo.py and train_grpo.py; defaults adjusted.\n- README: clarify external keys and sandbox_slice semantics. * Remove unnecessary try/except; robust sandbox_slice parsing without exceptions; minimal default tags * Fix: define external_cfg before use; remove duplicate assignments in train_magrpo.py and train_grpo.py * Fix: handle dataset load failure in train_magrpo.py (return early) to avoid UnboundLocalError * Configs: reduce num_train_epochs by 20% (rounded) across all YAMLs
1 parent 84aac96 commit 9331ca2

17 files changed

+292
-751
lines changed

README.md

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ python LLM_Collaboration_with_MARL/train_grpo.py \
5252
# Multi-turn override example
5353
python LLM_Collaboration_with_MARL/train_magrpo.py \
5454
--config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \
55-
--override dataset.train_split='test[:20]' dataset.eval_split='test[20:30]' \
55+
--override dataset.train_split='test[16:]' dataset.eval_split='test[:16]' \
5656
magrpo.num_turns=2 magrpo.turn_gradient_weights=[1.5,0.5]
5757
```
5858
### Legacy Command-Line Args
@@ -84,42 +84,40 @@ python LLM_Collaboration_with_MARL/train_magrpo.py \
8484

8585
### External Modes
8686

87-
Multi-turn training supports external transition modes for 2nd+ turns, set via `magrpo.external_mode`:
87+
Multi-turn training supports external transition modes for 2nd+ turns, set via `external.mode`:
8888

89-
- `expert_edits` **(default)**: Uses an expert LLM to suggest edits.
90-
- Requires `magrpo.expert_model` in config (e.g., `deepseek-coder`, Claude, etc.).
91-
- Requires corrsponding API keys in env vars.
89+
- `level_feedback` **(default)**: Detailed diagnostics (impl found, syntax with line/col, per-test pass/fail errors, aux usage).
90+
- Requires `external.expert_model` in config when using `expert_edits` (e.g., `deepseek-coder`, Claude, etc.). This parameter is ignored for other modes (`level_feedback`, `level_passed`, `passed`, `plain`).
91+
- Requires corrsponding API keys in env vars.
9292
- `level_passed`: Binary passed signals (impl found, syntax, tests summary, aux usage).
93-
- `level_feedback`: Detailed diagnostics (impl found, syntax with line/col, per-test pass/fail errors, aux usage).
9493
- `passed`: A binary signal — "All levels passed" or "Not all levels passed".
9594
- `plain`: No signals or diagnostics.
9695

9796
```bash
9897
# HumanEval with detailed feedback signals
9998
python LLM_Collaboration_with_MARL/train_magrpo.py \
10099
--config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \
101-
--override magrpo.external_mode='level_feedback'
100+
--override external.mode='level_feedback'
102101
```
103102

104103
### Sandbox Tests
105104

106-
The external modes obtain `entry_point` and tests via an internal resolver registered by the training script. **By default, the sandbox tests are the same as the dataset’s eval tests.**
107-
Note: `magrpo.sandbox_slice` only affects analysis-based modes (`level_feedback`, `level_passed`, `passed`), and it has no effect on `expert_edits`.
105+
The external modes obtain `entry_point` and tests via an internal resolver registered by the training script. **By default, sandbox executes only the first assert (`sandbox_slice=1`).** Use all eval tests by setting `external.sandbox_slice` to `0`, `None`, or `'all'`. A negative value uses the last N asserts. Note: `external.sandbox_slice` only affects analysis-based modes (`level_feedback`, `level_passed`, `passed`), and it has no effect on `expert_edits`.
108106

109107
```bash
110-
# Add a magrpo.sandbox_slice to override
108+
# Add an external.sandbox_slice override
111109
python LLM_Collaboration_with_MARL/train_magrpo.py \
112110
--config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \
113-
--override magrpo.external_mode='level_feedback' magrpo.sandbox_slice=-2
111+
--override external.mode='level_feedback' external.sandbox_slice=-2
114112
```
115113

116114
### Handoff Strategy
117115

118-
In MAGRPO, since agents generate a few responses per turn, we need to hand off one for efficiency, else the number of generations per turn will increase exponentially. External handoff controls which previous response is used as context for the later turns. **By default, the "best" completion per agent from the prior turn is used.** Random handoff requires the training loop to supply a candidate pool of previous-turn completions per agent to the external transition. If only a single completion per agent is available, random falls back to the best completion.
116+
In MAGRPO/GRPO multi-turn training, we hand off one prior completion per agent to keep compute bounded. The trainer selects this per the `handoff` mode: **default `random`**, or `best`. Selection happens in the CoMLRL trainer; external modes simply format the next-turn prompts using the provided completions. Configure via `magrpo.handoff` or `grpo.handoff` in your config or `--override`.
119117

120118

121119
```bash
122120
python LLM_Collaboration_with_MARL/train_magrpo.py \
123121
--config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \
124-
--override magrpo.external_mode='plain' magrpo.external_handoff='random'
122+
--override external.mode='plain' magrpo.handoff='best'
125123
```

configs/grpo_che_config.yaml

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
1-
# Configuration for CoopHumanEval single-agent training with GRPO
2-
3-
# Model configuration
1+
# model
42
model:
53
name: "Qwen/Qwen2.5-Coder-3B"
64
type: "qwen"
@@ -13,21 +11,28 @@ model:
1311
trust_remote_code: true
1412
torch_dtype: "auto"
1513

16-
# Dataset configuration
14+
# dataset
1715
dataset:
18-
name: "CoMLRL/CoopHumaneval"
19-
type: "coophumaneval" # Used to select formatters and reward function
20-
train_split: "test[:50]"
21-
eval_split: "test[50:66]"
16+
name: "CoMLRL/CoopHumanEval"
17+
type: "coophumaneval"
18+
train_split: "test[16:]"
19+
eval_split: "test[:16]"
2220

23-
# Output configuration
21+
# output
2422
output:
25-
base_dir: "../../../projects/bepg/tchen19/output"
26-
save_final_model: true
23+
base_dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
24+
save_final_model: false
25+
26+
# external
27+
external:
28+
mode: "level_feedback"
29+
sandbox_slice: 1
30+
original_prompt: true
31+
previous_response: true
2732

28-
# GRPO training configuration
33+
# grpo
2934
grpo:
30-
num_train_epochs: 20 # Same as multi-agent CHE
35+
num_train_epochs: 16
3136
per_device_train_batch_size: 1
3237
learning_rate: 1.0e-5
3338
logging_steps: 50
@@ -36,13 +41,13 @@ grpo:
3641
max_new_tokens: 256
3742
temperature: 0.8
3843
top_p: 0.95
39-
# Early termination threshold for single-agent (GRPO)
44+
handoff: random
4045
early_termination_threshold: 2.1
4146

42-
# Wandb configuration
47+
# wandb
4348
wandb:
4449
project: "mlrl"
4550
entity: "nu-llpr"
46-
name: "grpo_coophumaneval" # Will be appended with model name in script
47-
dir: "../../../projects/bevi/sliu30"
51+
name: "grpo_coophumaneval"
52+
dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
4853
tags: ["grpo", "coophumaneval", "single-agent"]

configs/grpo_he_config.yaml

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
1-
# Configuration for HumanEval single-agent training with GRPO
2-
# Based on train_he_single_agent.py parameters
3-
4-
# Model configuration
1+
# model
52
model:
63
name: "Qwen/Qwen2.5-Coder-3B"
74
type: "qwen"
@@ -14,36 +11,43 @@ model:
1411
trust_remote_code: true
1512
torch_dtype: "auto"
1613

17-
# Dataset configuration
14+
# dataset
1815
dataset:
1916
name: "openai/openai_humaneval"
20-
type: "humaneval" # Used to select formatters and reward function
21-
train_split: "test[33:133]"
17+
type: "humaneval"
18+
train_split: "test[33:163]"
2219
eval_split: "test[:32]"
2320

24-
# Output configuration
21+
# output
2522
output:
26-
base_dir: "../../../projects/bepg/tchen19/output"
27-
save_final_model: true
23+
base_dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
24+
save_final_model: false
25+
26+
# external
27+
external:
28+
mode: "level_feedback"
29+
sandbox_slice: 1
30+
original_prompt: true
31+
previous_response: true
2832

29-
# GRPO training configuration
33+
# grpo
3034
grpo:
31-
num_train_epochs: 10
35+
num_train_epochs: 8
3236
per_device_train_batch_size: 1
3337
learning_rate: 1.0e-5
3438
logging_steps: 50
3539
save_steps: 200
36-
num_generations: 4 # Number of completions to generate per prompt
40+
num_generations: 4
3741
max_new_tokens: 256
3842
temperature: 0.8
3943
top_p: 0.95
40-
# Early termination threshold for single-agent (GRPO)
44+
handoff: random
4145
early_termination_threshold: 2.1
4246

43-
# Wandb configuration
47+
# wandb
4448
wandb:
4549
project: "mlrl"
4650
entity: "nu-llpr"
47-
name: "grpo_humaneval" # Will be appended with model name in script
48-
dir: "../../../projects/bepg/sliu30"
51+
name: "grpo_humaneval"
52+
dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
4953
tags: ["grpo", "humaneval", "single-agent"]

configs/magrpo_che_config.yaml

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
1-
# Configuration for CoopHumanEval training with MAGRPO
2-
# Exact parameters from train_che.py
3-
4-
# Model configuration
1+
# model
52
model:
63
name: "Qwen/Qwen2.5-Coder-3B"
74
type: "qwen"
@@ -14,21 +11,28 @@ model:
1411
trust_remote_code: true
1512
torch_dtype: "auto"
1613

17-
# Dataset configuration
14+
# dataset
1815
dataset:
19-
name: "LovelyBuggies/CoopHumaneval"
20-
type: "coophumaneval" # Used to select formatters and reward function
21-
train_split: "test[:50]"
22-
eval_split: "test[50:66]"
16+
name: "CoMLRL/CoopHumanEval"
17+
type: "coophumaneval"
18+
train_split: "test[16:]"
19+
eval_split: "test[:16]"
2320

24-
# Output configuration
21+
# output
2522
output:
26-
base_dir: "../../../projects/bepg/sliu30/output"
27-
save_final_model: true
23+
base_dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
24+
save_final_model: false
25+
26+
# external
27+
external:
28+
mode: "level_feedback"
29+
sandbox_slice: 1
30+
original_prompt: true
31+
previous_response: true
2832

29-
# MAGRPO training configuration
33+
# magrpo
3034
magrpo:
31-
num_train_epochs: 20 # Exact value from train_che.py
35+
num_train_epochs: 16
3236
per_device_train_batch_size: 1
3337
learning_rate: 2.0e-5
3438
logging_steps: 50
@@ -38,13 +42,13 @@ magrpo:
3842
temperature: 0.8
3943
top_p: 0.95
4044
num_agents: 2
41-
# Early termination threshold for multi-agent (MAGRPO)
45+
handoff: random
4246
early_termination_threshold: 4.0
4347

44-
# Wandb configuration
48+
# wandb
4549
wandb:
4650
project: "mlrl"
4751
entity: "nu-llpr"
48-
name: "magrpo_coophumaneval" # Will be appended with model name in script
49-
dir: "../../../projects/bevi/sliu30"
52+
name: "magrpo_coophumaneval"
53+
dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
5054
tags: ["magrpo", "coophumaneval", "multi-agent"]

configs/magrpo_he_config.yaml

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,6 @@
1-
# Configuration for HumanEval training with MAGRPO
2-
# This file defines all parameters for training experiments
3-
4-
# Model configuration
1+
# model
52
model:
6-
name: "Qwen/Qwen2.5-Coder-3B" # Options: "Qwen/Qwen2.5-Coder-3B", "bigcode/starcoder2-3b", etc.
3+
name: "Qwen/Qwen2.5-Coder-3B"
74
type: "qwen"
85
temperature: 0.7
96
top_p: 0.9
@@ -14,35 +11,42 @@ model:
1411
trust_remote_code: true
1512
torch_dtype: "auto"
1613

17-
# Dataset configuration
14+
# dataset
1815
dataset:
1916
name: "openai/openai_humaneval"
20-
type: "humaneval" # Used to select formatters and reward function
21-
train_split: "test[33:133]"
17+
type: "humaneval"
18+
train_split: "test[33:163]"
2219
eval_split: "test[:32]"
2320

24-
# Output configuration
21+
# output
2522
output:
26-
base_dir: "../../../projects/bepg/sliu30/output"
27-
save_final_model: true
23+
base_dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
24+
save_final_model: false
25+
26+
# external
27+
external:
28+
mode: "level_feedback"
29+
sandbox_slice: 1
30+
original_prompt: true
31+
previous_response: true
2832

29-
# MAGRPO training configuration
33+
# magrpo
3034
magrpo:
31-
num_train_epochs: 10
35+
num_train_epochs: 8
3236
per_device_train_batch_size: 1
3337
learning_rate: 2.0e-5
3438
logging_steps: 50
3539
save_steps: 200
3640
num_generations: 4
3741
max_new_tokens: 256
3842
num_agents: 2
39-
# Early termination threshold for multi-agent (MAGRPO)
43+
handoff: random
4044
early_termination_threshold: 4.0
4145

42-
# Wandb configuration
46+
# wandb
4347
wandb:
4448
project: "mlrl"
4549
entity: "nu-llpr"
46-
name: "magrpo_humaneval" # Will be appended with model name in script
47-
dir: "../../../projects/bepg/sliu30"
50+
name: "magrpo_humaneval"
51+
dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
4852
tags: ["magrpo", "humaneval", "multi-agent"]

configs/mt_grpo_che_config.yaml

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
1-
# Configuration for Multi-Turn CoopHumanEval training with GRPO (single-agent)
2-
# Based on mt_magrpo_che_config.yaml parameters but adapted for single-agent
3-
4-
# Model configuration
1+
# model
52
model:
63
name: "Qwen/Qwen2.5-Coder-3B"
74
type: "qwen"
@@ -14,22 +11,29 @@ model:
1411
trust_remote_code: true
1512
torch_dtype: "bfloat16"
1613

17-
# Dataset configuration
14+
# dataset
1815
dataset:
19-
name: "LovelyBuggies/CoopHumaneval"
20-
type: "coophumaneval" # Used to select formatters and reward function
21-
train_split: "test[:50]"
22-
eval_split: "test[50:66]"
16+
name: "CoMLRL/CoopHumanEval"
17+
type: "coophumaneval"
18+
train_split: "test[16:]"
19+
eval_split: "test[:16]"
2320

24-
# Output configuration
21+
# output
2522
output:
26-
base_dir: "../../../projects/bevi/sliu30/output_mt"
27-
save_final_model: true
23+
base_dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo"
24+
save_final_model: false
25+
26+
# external
27+
external:
28+
mode: "level_feedback"
29+
sandbox_slice: 1
30+
original_prompt: true
31+
previous_response: true
2832

29-
# GRPO training configuration (multi-turn enabled via num_turns)
33+
# grpo
3034
grpo:
3135
num_turns: 2
32-
num_train_epochs: 10 # Reduced from 20 for multi-turn
36+
num_train_epochs: 8
3337
per_device_train_batch_size: 1
3438
learning_rate: 2.0e-5
3539
logging_steps: 50
@@ -38,17 +42,15 @@ grpo:
3842
max_new_tokens: 256
3943
temperature: 0.8
4044
top_p: 0.95
41-
# Multi-turn specific parameters
45+
handoff: random
4246
turn_gradient_weights: [1.2, 0.8]
4347
early_termination_weight: 2.0
4448
early_termination_threshold: 2.1
45-
external_mode: "expert_edits" # Options: expert_edits (default), level_passed, level_feedback, passed, plain
46-
expert_model: "deepseek-coder" # Used by expert_edits mode only
4749

48-
# Wandb configuration
50+
# wandb
4951
wandb:
5052
project: "mlrl"
5153
entity: "nu-llpr"
52-
name: "mt_grpo_coophumaneval" # Will be appended with model name in script
53-
dir: "../../../projects/bevi/sliu30"
54+
name: "mt_grpo_coophumaneval"
55+
dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo"
5456
tags: ["mt_grpo", "coophumaneval", "single-agent", "multi-turn"]

0 commit comments

Comments
 (0)