Mem-α: Learning Memory Construction via Reinforcement Learning

Official implementation of "Mem-α: Learning Memory Construction via Reinforcement Learning".

Overview

Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Mem-α is a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback.

Key Features:

🧠 Advanced Memory Architecture: Core, episodic, and semantic memory components
🎯 Reinforcement Learning Framework: Direct optimization for memory construction
📈 Strong Generalization: Trained on 30k tokens, generalizes to 400k+ tokens (13x training length)
🚀 State-of-the-art Performance: Significant improvements over existing memory-augmented agents

Resources:

📄 Paper
🤗 Model (Memalpha-4B)
📊 Training Dataset
📊 Evaluation Dataset - Processed Version
📊 MemoryAgentBench - Original

Installation

# Clone the repository
git clone git@github.com:wangyu-ustc/Mem-alpha.git
cd Mem-alpha

# Install dependencies
pip install -r requirements.txt

Dataset Preparation

Create a data folder in the project root and download the datasets:

# Download Memalpha training/test dataset
git clone https://huggingface.co/datasets/YuWangX/Memalpha ./data/memalpha

# Download MemoryAgentBench evaluation dataset (processed version for this project)
git clone https://huggingface.co/datasets/YuWangX/Memalpha-Memoryagentbench ./data/memoryagentbench

Note: We use a processed version of the original MemoryAgentBench dataset. See Dataset Processing for details.

Expected directory structure:

data/
├── memalpha/
│   ├── train.parquet
│   └── test.parquet
└── memoryagentbench/
    ├── train.parquet
    └── test.parquet

Note: If you prefer to process the datasets from scratch, see Dataset Processing.

Training

Main Model

To train the Memalpha-4B model with optimal hyperparameters (β=0.05, γ=0.1):

bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.05-content0.1.sh

Ablation Studies

The following scripts reproduce the ablation study results from the paper:

# β=0.05, γ=0.0 (No content reward)
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.05-content0.0.sh

# β=0.0, γ=0.1 (No compression reward)
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.0-content0.1.sh

# β=0.05, γ=0.1 (Main configuration)
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.05-content0.1.sh

# β=0.2, γ=0.1 (Higher compression penalty)
bash scripts/train_memory_grpo_qwen3-8b-4node-compression0.2-content0.1.sh

# β=0.4, γ=0.1 (Highest compression penalty)
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.4-content0.1.sh

Parameter explanations:

β (beta): Compression reward coefficient - penalizes excessive memory usage
γ (gamma): Content reward coefficient - measure whether the information is put into the correct memory type

Evaluation

Running the Memory Server

Before evaluation, start the memory server that manages the memory system:

python memory_server.py --port 5006 --server_url http://localhost:8001/v1

Evaluating Trained Models

Evaluate the trained Memalpha-4B model on both datasets:

# Evaluate on Memalpha dataset
python main.py --agent_config config/memalpha-qwen3-4b_agent_0.05-0.1.yaml --dataset memalpha

# Evaluate on MemoryAgentBench dataset
python main.py --agent_config config/memalpha-qwen3-4b_agent_0.05-0.1.yaml --dataset memoryagentbench

Baseline Evaluations

We provide evaluation scripts for several baseline methods:

Long-Context, RAG Baselines and MemAgent

Evaluate long-context models and BM25-based retrieval on both datasets:

# Memalpha dataset
python long_context_eval.py --model qwen3-32b --dataset memalpha              # Long-context baseline
python long_context_eval.py --model qwen3-32b-bm25 --dataset memalpha         # BM25 retrieval
python long_context_eval.py --model gpt-4o-mini --dataset memalpha            # GPT-4o-mini
python long_context_eval.py --model memagent-14b --dataset memalpha           # MemAgent baseline

# MemoryAgentBench dataset
python long_context_eval.py --model qwen3-32b --dataset memoryagentbench
python long_context_eval.py --model qwen3-32b-bm25 --dataset memoryagentbench
python long_context_eval.py --model gpt-4o-mini --dataset memoryagentbench
python long_context_eval.py --model memagent-14b --dataset memoryagentbench

MEM1 Baseline

To evaluate MEM1 as a baseline:

# Start the VLLM server for MEM1
cd MEM1/Mem1/inference
bash start_vllm.sh
cd ../../..

# Run MEM1 evaluation on both datasets
cd MEM1/Mem1
python inference/generate_rollout.py \
    --model Mem-Lab/Qwen2.5-7B-RL-RAG-Q2-EM-Release \
    --use_mem1 \
    --data_file ../../data/memalpha/test.parquet

python inference/generate_rollout.py \
    --model Mem-Lab/Qwen2.5-7B-RL-RAG-Q2-EM-Release \
    --use_mem1 \
    --data_file ../../data/memoryagentbench/test.parquet

Dataset Processing

Memalpha Dataset

If you want to build the Memalpha dataset from scratch instead of downloading it:

# Process individual datasets
python process_data.py --dataset squad
python process_data.py --dataset squad --split-single-dataset

python process_data.py --dataset hotpotqa
python process_data.py --dataset hotpotqa --split-single-dataset

python process_data.py --dataset booksum
python data_preprocess/extract_booksum_keywords.py --mode replace
python process_data.py --dataset booksum --split-single-dataset

python process_data.py --dataset pubmed-rct
python process_data.py --dataset pubmed-rct --split-single-dataset

python process_data.py --dataset perltqa
python process_data.py --dataset perltqa --split-single-dataset

python process_data.py --dataset ttl_train
python process_data.py --dataset ttl_train --split-single-dataset

python process_data.py --dataset lme_train
python process_data.py --dataset lme_train --split-single-dataset

# Merge all datasets
python process_data.py --merge-datasets pubmed-rct lme_train squad hotpotqa perltqa ttl_train booksum --limit-size 100

MemoryAgentBench Dataset

To build the MemoryAgentBench evaluation dataset from scratch:

# Process MemoryAgentBench components
python process_data.py --dataset accurate_retrieval
python process_data.py --dataset test_time_learning
python process_data.py --dataset long_range_understanding

# Merge into final evaluation set
python process_data.py --merge-datasets accurate_retrieval long_range_understanding test_time_learning --output-name memoryagentbench

⚠️ Warning: Since MemoryAgentBench is continuously updated, processing from scratch may yield different results than the published dataset. We recommend downloading our processed version directly from HuggingFace for reproducibility.

Note: Our evaluation uses a processed version of the original MemoryAgentBench dataset (paper). The processing scripts above show how we adapted it for our experiments.

Citation

If you find this work helpful, please cite our paper:

@article{wang2025memalpha,
  title={Mem-$\alpha$: Learning Memory Construction via Reinforcement Learning},
  author={Wang, Yu and Takanobu, Ryuichi and Liang, Zhiqi and Mao, Yuzhen and Hu, Yuanzhe and McAuley, Julian and Wu, Xiaojian},
  journal={arXiv preprint arXiv:2509.25911},
  year={2025}
}

If you use our processed MemoryAgentBench dataset, please also cite the original work:

@article{hu2025evaluating,
  title={Evaluating memory in llm agents via incremental multi-turn interactions},
  author={Hu, Yuanzhe and Wang, Yu and McAuley, Julian},
  journal={arXiv preprint arXiv:2507.05257},
  year={2025}
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact: For questions or issues, please open an issue on GitHub or contact Yu Wang at yuw164@ucsd.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
MEM1		MEM1
config		config
data_preprocess		data_preprocess
memalpha		memalpha
metrics		metrics
scripts		scripts
verl		verl
.env.example		.env.example
README.md		README.md
agent.py		agent.py
conversation_creator.py		conversation_creator.py
convert_checkpoint.py		convert_checkpoint.py
evaluate_agent_results.py		evaluate_agent_results.py
functions.py		functions.py
long_context_eval.py		long_context_eval.py
main.py		main.py
memory.py		memory.py
memory_server.py		memory_server.py
pip_list.txt		pip_list.txt
process_data.py		process_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mem-α: Learning Memory Construction via Reinforcement Learning

Table of Contents

Overview

Installation

Dataset Preparation

Training

Main Model

Ablation Studies

Evaluation

Running the Memory Server

Evaluating Trained Models

Baseline Evaluations

Long-Context, RAG Baselines and MemAgent

MEM1 Baseline

Dataset Processing

Memalpha Dataset

MemoryAgentBench Dataset

Citation

License

About

Uh oh!

Releases

Packages

Languages

wangyu-ustc/Mem-alpha

Folders and files

Latest commit

History

Repository files navigation

Mem-α: Learning Memory Construction via Reinforcement Learning

Table of Contents

Overview

Installation

Dataset Preparation

Training

Main Model

Ablation Studies

Evaluation

Running the Memory Server

Evaluating Trained Models

Baseline Evaluations

Long-Context, RAG Baselines and MemAgent

MEM1 Baseline

Dataset Processing

Memalpha Dataset

MemoryAgentBench Dataset

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages