|
1 | 1 | # LLM Collaboration with MARL |
2 | 2 |
|
3 | | -This repository contains training scripts and configurations for the paper "LLM Collaboration with Multi‑Agent Reinforcement Learning". |
4 | | -- [Benchmarks](#benchmarks) |
5 | | -- [Training Scripts](#training-scripts) |
6 | | - - [Default Configs](#default-configs) |
7 | | - - [Parameter Overrides](#parameter-overrides) |
8 | | -- [Multi-Turn Settings](#multi-turn-settings) |
9 | | - - [2+Turn Prompt Composition](#2turn-prompt-composition) |
10 | | - - [External Modes](#external-modes) |
11 | | - - [Sandbox Tests](#sandbox-tests) |
| 3 | +Training scripts and configs for _"LLM Collaboration with Multi‑Agent Reinforcement Learning"_. |
12 | 4 |
|
13 | 5 | ## Benchmarks |
14 | 6 |
|
15 | | -- HumanEval (HE): 164 problems on split `test` |
16 | | -- CoopHumanEval (CHE): 82 problems on split `test` |
| 7 | +- MBPP: 427 problems on split `sanitized` |
| 8 | +- HumanEval: 164 problems on split `test` |
| 9 | +- CoopHumanEval: 82 problems on split `test` |
17 | 10 |
|
18 | 11 | ## Training Scripts |
19 | 12 |
|
20 | 13 | ### Default Configs |
21 | 14 |
|
22 | 15 | ```bash |
23 | | -# Single-agent HumanEval (GRPO) |
24 | 16 | python LLM_Collaboration_with_MARL/train_grpo.py \ |
25 | 17 | --config LLM_Collaboration_with_MARL/configs/grpo_he_config.yaml |
26 | 18 |
|
27 | | -# Multi-agent CoopHumanEval (MAGRPO) |
28 | 19 | python LLM_Collaboration_with_MARL/train_magrpo.py \ |
29 | 20 | --config LLM_Collaboration_with_MARL/configs/magrpo_che_config.yaml |
30 | | - |
31 | | -# Multi-turn HumanEval (MT-MAGRPO) |
32 | | -python LLM_Collaboration_with_MARL/train_magrpo.py \ |
33 | | - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml |
34 | 21 | ``` |
35 | 22 |
|
36 | 23 | ### Parameter Overrides |
37 | 24 |
|
38 | | -You can override any configuration parameter using `--override`: |
| 25 | +You can always override any configuration parameter using `--override`: |
39 | 26 |
|
40 | 27 | ```bash |
41 | | -# Change model |
42 | 28 | python LLM_Collaboration_with_MARL/train_magrpo.py \ |
43 | 29 | --config LLM_Collaboration_with_MARL/configs/magrpo_he_config.yaml \ |
44 | | - --override model_name='bigcode/starcoder2-3b' |
| 30 | + --override model.name='bigcode/starcoder2-3b' magrpo.num_turns=1 |
| 31 | +``` |
45 | 32 |
|
46 | | -# Modify training params |
47 | | -python LLM_Collaboration_with_MARL/train_grpo.py \ |
48 | | - --config LLM_Collaboration_with_MARL/configs/grpo_che_config.yaml \ |
49 | | - --override grpo.num_train_epochs=20 grpo.learning_rate=3e-5 |
| 33 | +## Settings |
50 | 34 |
|
51 | | -# Multi-turn override example |
52 | | -python LLM_Collaboration_with_MARL/train_magrpo.py \ |
53 | | - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \ |
54 | | - --override dataset.train_split='test[16:]' dataset.eval_split='test[:16]' \ |
55 | | - magrpo.num_turns=2 |
| 35 | +### Joint Action Modes |
56 | 36 |
|
57 | | -# Enable code-level training metrics (expensive; default is off) |
58 | | -python LLM_Collaboration_with_MARL/train_magrpo.py \ |
59 | | - --config LLM_Collaboration_with_MARL/configs/magrpo_he_config.yaml \ |
60 | | - --override magrpo.log_code_levels=true |
61 | | -``` |
62 | | -## Multi-Turn Settings |
| 37 | +`magrpo.joint_mode` determine how to combine each agent's K generations into joint actions at each turn. 2 modes are supported: if set 'align' by default, each agent's k-th generation is paired with the other agents' k-th generations to form a joint action; if set 'cross', all combinations of the agents' K generations are used to form joint actions (K^N joint actions for N agents). |
63 | 38 |
|
64 | | -### 2+Turn Prompt Composition |
| 39 | +Since the number of samples will also grow exponentially with the number of turns, aligned joint will be **more flexible** (\#samples could not be a perfect power) and hence faster to train in wall time. However, using cross joint will be more sample efficient (much lower VRAM compare to 'align' when num_generations=K^N), it also performs better since the value estimation is more accurate. |
65 | 40 |
|
66 | | -To save memory usage, 2+ turn prompts **include the previous response without the original first‑turn problem prompt by default**. You can add the original prompt to match the concept of observation-action history in MARL. |
| 41 | +### Number of Turns |
67 | 42 |
|
68 | | -```bash |
69 | | -python LLM_Collaboration_with_MARL/train_magrpo.py \ |
70 | | - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \ |
71 | | - --override magrpo.external_original_prompt=True magrpo.external_previous_response=True |
72 | | -``` |
| 43 | +`magrpo.num_turns` determines the number of turns (`magrpo.num_turns=2` by default). The number of samples at each turn will grow exponentially with the number of turns: K^TN at turn T if cross joint, K^N if aligned joint. |
73 | 44 |
|
74 | | -### External Modes |
| 45 | +### Early Termination |
75 | 46 |
|
76 | | -Multi-turn training supports external transition modes for 2nd+ turns, set via `external.mode`: |
| 47 | +`magrpo.termination_threshold` is used to incentive agents to find high-reward solutions quickly, instead of expanding the full Monte Carlo tree. |
77 | 48 |
|
78 | | -- `level_feedback` **(default)**: Detailed diagnostics (impl found, syntax with line/col, per-test pass/fail errors, aux usage). |
79 | | - - Requires `external.expert_model` in config when using `expert_edits` (e.g., `deepseek-coder`, Claude, etc.). This parameter is ignored for other modes (`level_feedback`, `level_passed`, `passed`, `plain`). |
80 | | -- Requires corrsponding API keys in env vars. |
81 | | -- `level_passed`: Binary passed signals (impl found, syntax, tests summary, aux usage). |
82 | | -- `passed`: A binary signal — "All levels passed" or "Not all levels passed". |
83 | | -- `plain`: No signals or diagnostics. |
| 49 | +At each node (branch, turn), compute the mean immediate **reward across the sibling** joint actions at that node. If the mean exceeds the threshold, that branch stops expanding at this turn; training backpropagates from the truncated subtree. Other branches continue. |
84 | 50 |
|
85 | | -```bash |
86 | | -# HumanEval with detailed feedback signals |
87 | | -python LLM_Collaboration_with_MARL/train_magrpo.py \ |
88 | | - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \ |
89 | | - --override external.mode='level_feedback' |
90 | | -``` |
| 51 | +### Multi-Turn Prompt |
91 | 52 |
|
92 | | -### Sandbox Tests |
| 53 | +`external.original_prompt` and `external.previous_response` both default as `true`. 2+ turn prompts include both the original first‑turn problem prompt and the previous response by default to preserve full context; you can shorten the context by setting either to `false` (for example, keep only the previous response to reduce tokens while retaining the most recent interaction). |
93 | 54 |
|
94 | | -The external modes obtain `entry_point` and tests via an internal resolver registered by the training script. **By default, sandbox executes only the first assert (`sandbox_slice=1`).** Use all eval tests by setting `external.sandbox_slice` to `0`, `None`, or `'all'`. A negative value uses the last N asserts. Note: `external.sandbox_slice` only affects analysis-based modes (`level_feedback`, `level_passed`, `passed`), and it has no effect on `expert_edits`. |
| 55 | +### External Modes |
95 | 56 |
|
96 | | -```bash |
97 | | -# Add an external.sandbox_slice override |
98 | | -python LLM_Collaboration_with_MARL/train_magrpo.py \ |
99 | | - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \ |
100 | | - --override external.mode='level_feedback' external.sandbox_slice=-2 |
101 | | -``` |
| 57 | +`external.mode` is set to be 'level_feedback' by default. This gives additional information from external to prompts in the following turns; 'level_feedback' attaches test‑driven diagnostics, while alternatives include 'expert_edits' (an LLM proposes edits), 'level_passed'/'passed' (binary outcomes), and 'plain' (no signals). |
| 58 | + |
| 59 | +Specific settings for 'level_feedback' is `external.sandbox_slice`, which controls how many eval tests to include in the feedback. By default, sandbox executes only the first assert (sandbox_slice=1). Use all eval tests by setting `external.sandbox_slice` to 0, None, or 'all'. Negative values use the last asserts. `external.sandbox_slice` only affects analysis-based modes ('level_feedback', 'level_passed', 'passed'), and it has no effect on 'expert_edits'. |
| 60 | + |
| 61 | +Specific settings for 'expert_edits' is `external.expert_edits_model`, which controls which LLM to use for proposing edits. By default, it uses DeepSeek-Coder. You can also change it to Claude-3, GPT-4, once you have keys/tokens in your global environment variables. |
| 62 | + |
| 63 | +### Output |
| 64 | + |
| 65 | +`output.save_model` is set to `false` by default because of the huge storage required by multiple LLMs. `verbose` is used for debug printing on cluster if set to be true, but it is default to be false and you can only see a tqdm bar that shows the training progress. You can also turn on `magrpo.log_code_levels` to log the level-rewards during training, but it will crazily slow down the training. |
0 commit comments