OpenMLRL
diff --git a/‎README.md‎
Lines changed: 4 additions & 4 deletions b/‎README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎configs/grpo_che_config.yaml‎
Lines changed: 3 additions & 2 deletions b/‎configs/grpo_che_config.yaml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎configs/grpo_he_config.yaml‎
Lines changed: 3 additions & 2 deletions b/‎configs/grpo_he_config.yaml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎configs/grpo_mbpp_config.yaml‎
Lines changed: 3 additions & 2 deletions b/‎configs/grpo_mbpp_config.yaml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎configs/magrpo_che_config.yaml‎
Lines changed: 3 additions & 2 deletions b/‎configs/magrpo_che_config.yaml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎configs/magrpo_he_config.yaml‎
Lines changed: 3 additions & 2 deletions b/‎configs/magrpo_he_config.yaml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎configs/magrpo_mbpp_config.yaml‎
Lines changed: 3 additions & 2 deletions b/‎configs/magrpo_mbpp_config.yaml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎external/__init__.py‎
Lines changed: 39 additions & 13 deletions b/‎external/__init__.py‎
Lines changed: 39 additions & 13 deletions
diff --git a/‎external/expert_edits.py‎
Lines changed: 83 additions & 18 deletions b/‎external/expert_edits.py‎
Lines changed: 83 additions & 18 deletions
@@ -30,7 +30,7 @@ python LLM_Collaboration_with_MARL/train_magrpo.py \
 
 ### Joint Action
 
-`magrpo.joint_mode` determines how to combine each agent’s G generations into joint actions at each turn. Two modes are supported: 'align' (default), which pairs the g‑th generation of every agent to form G joint actions per node; and 'cross', which forms the Cartesian product within a node, yielding G^N joint actions per node (N agents). Total leaf joint trajectories after T turns (no early termination): align → G^T; cross → G^{N·T}. 
+`magrpo.joint_mode` determines how to combine each agent’s $G$ generations into joint actions at each turn. Two modes are supported: 'align' (default), which pairs the $g$‑th generation of every agent to form $G$ joint actions per node; and 'cross', which forms the Cartesian product within a node, yielding $G^N$ joint actions per node ($N$ agents). Total leaf joint trajectories after $T$ turns (no early termination): align → $G^T$; cross → $G^{N\cdot T}$. 
 
 Aligned is faster in wall‑time (fewer sibling evaluations per node), while cross is more sample‑efficient (better value estimation) without extra VRAM because it reuses the same G generations per agent and only crosses them within the node. We never cross across different nodes/prompts; this preserves causal state consistency (actions are conditioned on the same prompts), keeps siblings comparable for the baseline/advantage, maintains correct credit assignment (log‑probs matched to rewards from the same state), and remains computationally tractable.
 
@@ -40,15 +40,15 @@ Advantages are used to optimize the agents policies, which use a mean baseline w
 
 ### Number of Samples
 
-`magrpo.num_turns` is the number of turns in training and evaluation, and `magrpo.num_generations` is the number of samples per generation. Leaf (total samples at current turn) counts grow with T: `aligned` → G^T; `cross` → G^{N·T}. At each node, the sibling set (competing joint actions under the same prompt/context/turn) has size G for `aligned`, and G^N for `cross`. The policy‑gradient baseline is the mean return over these siblings at that node, i.e., advantage Aᵢ = Returnᵢ − mean_sibling(Return).
+`magrpo.num_turns` is the number of turns in training and evaluation, and `magrpo.num_generations` is the number of samples per generation. Leaf (total samples at current turn) counts grow with $T$: `aligned` → $G^T$; `cross` → $G^{N\cdot T}$. At each node, the sibling set (competing joint actions under the same prompt/context/turn) has size $G$ for `aligned`, and $G^N$ for `cross`. The policy‑gradient baseline is the mean return over these siblings at that node, i.e., advantage $A_i = \mathrm{Return}_i - \operatorname{mean}_{\text{sibling}}(\mathrm{Return})$.
 
 ### Termination
 
 `magrpo.termination_threshold` is used to incentivize agents to find high‑reward solutions quickly instead of expanding the full Monte Carlo tree. At each node (branch, turn), we compute the mean immediate reward across that node’s sibling joint actions; if the mean exceeds the threshold, that branch stops expanding at this turn and the trainer backpropagates from the truncated subtree. Other branches continue.
 
-### New Prompts
+### History Controls
 
-`external.original_prompt` and `external.previous_response` both default as true. 2+ turn prompts include both the original first‑turn problem prompt and the previous response by default to preserve full context; you can shorten the context by setting either to false (for example, keep only the previous response to reduce tokens while retaining the most recent interaction).
+`external.memory_mode` controls how many memory and what memory to be given to each agent, it should be selected from `last`, `full`, `memoryful`. `full` (default) includes all prior prompts/responses per flags (compact "History" block); `last` includes only first‑turn prompt and last response per flags; `memoryful`: rely on model’s internal state (KV cache), trainer carries per‑agent KV caches across turns and continues generation from them and prompts omit explicit history. In addition, `external.previous_prompts` and `external.previous_responses` determine which parts of the agent‑wise history are inserted into the next‑turn prompt text: in `last`, `previous_prompts` includes the agent’s first‑turn prompt and `previous_responses` includes only the most recent response; in `full`, `previous_prompts` includes all prior prompts and `previous_responses` includes all prior responses; in `memoryful`, neither history is injected into the text because the per‑agent KV cache already carries this context.
 
 ### External Modes
 
 
@@ -28,8 +28,9 @@ output:
 external:
   mode: "level_feedback"
   sandbox_slice: 1
-  original_prompt: true
-  previous_response: true
+  previous_prompts: true
+  previous_responses: true
+  memory_mode: full
 
 # grpo
 grpo:
 
@@ -28,8 +28,9 @@ output:
 external:
   mode: "level_feedback"
   sandbox_slice: 1
-  original_prompt: true
-  previous_response: true
+  previous_prompts: true
+  previous_responses: true
+  memory_mode: full
 
 # grpo
 grpo:
 
@@ -28,8 +28,9 @@ output:
 external:
   mode: "level_feedback"
   sandbox_slice: 1
-  original_prompt: true
-  previous_response: true
+  previous_prompts: true
+  previous_responses: true
+  memory_mode: full
 
 # grpo
 grpo:
 
@@ -28,8 +28,9 @@ output:
 external:
   mode: "level_feedback"
   sandbox_slice: 1
-  original_prompt: true
-  previous_response: true
+  previous_prompts: true
+  previous_responses: true
+  memory_mode: full
 
 # magrpo
 magrpo:
 
@@ -28,8 +28,9 @@ output:
 external:
   mode: "level_feedback"
   sandbox_slice: 1
-  original_prompt: true
-  previous_response: true
+  previous_prompts: true
+  previous_responses: true
+  memory_mode: full
 
 # magrpo
 magrpo:
 
@@ -28,8 +28,9 @@ output:
 external:
   mode: "level_feedback"
   sandbox_slice: 1
-  original_prompt: true
-  previous_response: true
+  previous_prompts: true
+  previous_responses: true
+  memory_mode: full
 
 # magrpo
 magrpo:
 
@@ -44,6 +44,14 @@ def get_external_transition(
     agent_completions: Union[List[str], Tuple[str, str]],
     num_agents: int = 2,
     mode: str = "expert_edits",
+    *,
+    # New history flags
+    previous_prompts: bool = False,
+    previous_responses: bool = True,
+    memory_mode: str = "last",
+    # Per-branch history from trainer
+    prompt_history_per_agent: Optional[List[List[str]]] = None,
+    response_history_per_agent: Optional[List[List[str]]] = None,
     **kwargs,
 ) -> Union[List[str], Tuple[str, str]]:
     """
@@ -84,9 +92,12 @@ def print(*args, **kwargs):  # type: ignore
 
     # Route to the requested mode implementation
     mode = (mode or "").lower()
-    # Pull common flags controlling prompt composition
-    original_prompt_flag = kwargs.get("original_prompt", False)
-    previous_response_flag = kwargs.get("previous_response", True)
+    memory_mode = (memory_mode or "last").lower()
+    # Prepare normalized histories
+    if prompt_history_per_agent is None:
+        prompt_history_per_agent = [[] for _ in range(int(num_agents))]
+    if response_history_per_agent is None:
+        response_history_per_agent = [[] for _ in range(int(num_agents))]
 
     if mode == "expert_edits":
         if int(num_agents) == 1:
@@ -113,9 +124,12 @@ def print(*args, **kwargs):  # type: ignore
             entry_point=entry_point,
             aux_completion=aux_comp,
             main_completion=main_comp,
-            original_prompt_flag=original_prompt_flag,
-            previous_response_flag=previous_response_flag,
+            previous_prompts=previous_prompts,
+            previous_responses=previous_responses,
+            memory_mode=memory_mode,
             num_agent=int(num_agents),
+            prompt_history_per_agent=prompt_history_per_agent,
+            response_history_per_agent=response_history_per_agent,
         )
 
         # Print preview
@@ -145,9 +159,12 @@ def print(*args, **kwargs):  # type: ignore
             main_completion=main_comp,
             test_code=test_code,
             entry_point=entry_point,
-            original_prompt_flag=original_prompt_flag,
-            previous_response_flag=previous_response_flag,
+            previous_prompts=previous_prompts,
+            previous_responses=previous_responses,
+            memory_mode=memory_mode,
             num_agent=int(num_agents),
+            prompt_history_per_agent=prompt_history_per_agent,
+            response_history_per_agent=response_history_per_agent,
         )
         print("\n" + "=" * 60)
         print("EXTERNAL MODE PREVIEW: level_feedback")
@@ -174,9 +191,12 @@ def print(*args, **kwargs):  # type: ignore
             main_completion=main_comp,
             test_code=test_code,
             entry_point=entry_point,
-            original_prompt_flag=original_prompt_flag,
-            previous_response_flag=previous_response_flag,
+            previous_prompts=previous_prompts,
+            previous_responses=previous_responses,
+            memory_mode=memory_mode,
             num_agent=int(num_agents),
+            prompt_history_per_agent=prompt_history_per_agent,
+            response_history_per_agent=response_history_per_agent,
         )
         print("\n" + "=" * 60)
         print("EXTERNAL MODE PREVIEW: level_passed")
@@ -203,9 +223,12 @@ def print(*args, **kwargs):  # type: ignore
             main_completion=main_comp,
             test_code=test_code,
             entry_point=entry_point,
-            original_prompt_flag=original_prompt_flag,
-            previous_response_flag=previous_response_flag,
+            previous_prompts=previous_prompts,
+            previous_responses=previous_responses,
+            memory_mode=memory_mode,
             num_agent=int(num_agents),
+            prompt_history_per_agent=prompt_history_per_agent,
+            response_history_per_agent=response_history_per_agent,
         )
         print("\n" + "=" * 60)
         print("EXTERNAL MODE PREVIEW: passed")
@@ -232,9 +255,12 @@ def print(*args, **kwargs):  # type: ignore
             main_completion=main_comp,
             test_code=test_code,
             entry_point=entry_point,
-            original_prompt_flag=original_prompt_flag,
-            previous_response_flag=previous_response_flag,
+            previous_prompts=previous_prompts,
+            previous_responses=previous_responses,
+            memory_mode=memory_mode,
             num_agent=int(num_agents),
+            prompt_history_per_agent=prompt_history_per_agent,
+            response_history_per_agent=response_history_per_agent,
         )
         print("\n" + "=" * 60)
         print("EXTERNAL MODE PREVIEW: plain")
 
@@ -1,7 +1,7 @@
 import json
 import os
 import re
-from typing import List, Tuple
+from typing import List, Tuple, Optional
 
 from anthropic import Anthropic
 from openai import OpenAI
@@ -164,9 +164,13 @@ def format_followup_prompts(
     entry_point: str = "",
     aux_completion: str = "",
     main_completion: str = "",
-    original_prompt_flag: bool = False,
-    previous_response_flag: bool = True,
+    *,
+    previous_prompts: bool = False,
+    previous_responses: bool = True,
+    memory_mode: str = "last",
     num_agent: int = 2,
+    prompt_history_per_agent: Optional[List[List[str]]] = None,
+    response_history_per_agent: Optional[List[List[str]]] = None,
 ) -> Tuple[str, str]:
     """
     Format the 2+ turn prompts for expert_edits mode to match other modes:
@@ -177,6 +181,23 @@ def format_followup_prompts(
 
     target_entry = entry_point or "main"
 
+    # Normalize histories
+    memory_mode = (memory_mode or "last").lower()
+    # If full history has only one prior turn, render identically to 'last'
+    if memory_mode == "full":
+        try:
+            counts_p = [len(x or []) for x in (prompt_history_per_agent or [])]
+            counts_r = [len(x or []) for x in (response_history_per_agent or [])]
+            if counts_p and max(counts_p) <= 1 and counts_r and max(counts_r) <= 1:
+                memory_mode = "last"
+        except Exception:
+            pass
+
+    if prompt_history_per_agent is None:
+        prompt_history_per_agent = [[] for _ in range(int(num_agent))]
+    if response_history_per_agent is None:
+        response_history_per_agent = [[] for _ in range(int(num_agent))]
+
     # Single-agent: only build main prompt; no aux references
     if int(num_agent) == 1:
         main_lines: List[str] = []
@@ -186,20 +207,33 @@ def format_followup_prompts(
             or "<no implementation found>"
         )
 
-        if original_prompt_flag:
-            _aux_base, main_base = build_first_turn_prompts(
-                original_prompt, target_entry
-            )
-            main_lines.extend([main_base, ""])  # context then blank line
-
-        if previous_response_flag:
+        if memory_mode == "full":
+            if previous_prompts and prompt_history_per_agent and prompt_history_per_agent[0]:
+                main_lines.extend(["History: previous prompts:"])
+                for t, ph in enumerate(prompt_history_per_agent[0], start=1):
+                    main_lines.append(f"- Turn {t} prompt:\n{ph}")
+                main_lines.append("")
+            if previous_responses and response_history_per_agent and response_history_per_agent[0]:
+                main_lines.extend(["History: your previous responses:"])
+                for t, resp in enumerate(response_history_per_agent[0], start=1):
+                    main_lines.append(f"- Turn {t} response:\n{resp}")
+                main_lines.append("")
+        elif memory_mode == "last":
+            if previous_prompts:
+                _aux_base, main_base = build_first_turn_prompts(
+                    original_prompt, target_entry
+                )
+                main_lines.extend([main_base, ""])  # context then blank line
+        if memory_mode == "last" and previous_responses:
             main_lines.extend(
                 [
                     "Your previous implementation:",
                     prev_main,
                     "",
                 ]
             )
+        elif memory_mode == "memoryful":
+            pass
 
         main_lines.extend(
             [
@@ -225,14 +259,45 @@ def format_followup_prompts(
     aux_lines: List[str] = []
     main_lines: List[str] = []
 
-    if original_prompt_flag:
-        aux_base, main_base = build_first_turn_prompts(original_prompt, target_entry)
-        aux_lines.extend([aux_base, ""])  # add a blank line after context
-        main_lines.extend([main_base, ""])
-
-    if previous_response_flag:
-        aux_lines.extend(["Your previous aux(...) implementation:", prev_aux, ""])
-        main_lines.extend(["Your previous main implementation:", prev_main, ""])
+    if memory_mode == "full":
+        if previous_prompts:
+            if prompt_history_per_agent and len(prompt_history_per_agent) >= 2:
+                aux_ph = prompt_history_per_agent[0]
+                main_ph = prompt_history_per_agent[1]
+                if aux_ph:
+                    aux_lines.append("History: previous prompts:")
+                    for t, ph in enumerate(aux_ph, start=1):
+                        aux_lines.append(f"- Turn {t} prompt:\n{ph}")
+                    aux_lines.append("")
+                if main_ph:
+                    main_lines.append("History: previous prompts:")
+                    for t, ph in enumerate(main_ph, start=1):
+                        main_lines.append(f"- Turn {t} prompt:\n{ph}")
+                    main_lines.append("")
+        if previous_responses:
+            if response_history_per_agent and len(response_history_per_agent) >= 2:
+                aux_rh = response_history_per_agent[0]
+                main_rh = response_history_per_agent[1]
+                if aux_rh:
+                    aux_lines.append("History: your previous aux(...) responses:")
+                    for t, resp in enumerate(aux_rh, start=1):
+                        aux_lines.append(f"- Turn {t} response:\n{resp}")
+                    aux_lines.append("")
+                if main_rh:
+                    main_lines.append("History: your previous main responses:")
+                    for t, resp in enumerate(main_rh, start=1):
+                        main_lines.append(f"- Turn {t} response:\n{resp}")
+                    main_lines.append("")
+    elif memory_mode == "last":
+        if previous_prompts:
+            aux_base, main_base = build_first_turn_prompts(original_prompt, target_entry)
+            aux_lines.extend([aux_base, ""])  # add a blank line after context
+            main_lines.extend([main_base, ""])
+        if previous_responses:
+            aux_lines.extend(["Your previous aux(...) implementation:", prev_aux, ""])
+            main_lines.extend(["Your previous main implementation:", prev_main, ""])
+    elif memory_mode == "memoryful":
+        pass
 
     aux_lines.extend(
         [