Official implementation of "Mem-α: Learning Memory Construction via Reinforcement Learning".
- Overview
 - Installation
 - Dataset Preparation
 - Training
 - Evaluation
 - Dataset Processing
 - Citation
 - Acknowledgments
 - License
 
Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Mem-α is a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback.
Key Features:
- 🧠 Advanced Memory Architecture: Core, episodic, and semantic memory components
 - 🎯 Reinforcement Learning Framework: Direct optimization for memory construction
 - 📈 Strong Generalization: Trained on 30k tokens, generalizes to 400k+ tokens (13x training length)
 - 🚀 State-of-the-art Performance: Significant improvements over existing memory-augmented agents
 
Resources:
- 📄 Paper
 - 🤗 Model (Memalpha-4B)
 - 📊 Training Dataset
 - 📊 Evaluation Dataset - Processed Version
 - 📊 MemoryAgentBench - Original
 
# Clone the repository
git clone git@github.com:wangyu-ustc/Mem-alpha.git
cd Mem-alpha
# Install dependencies
pip install -r requirements.txtCreate a data folder in the project root and download the datasets:
# Download Memalpha training/test dataset
git clone https://huggingface.co/datasets/YuWangX/Memalpha ./data/memalpha
# Download MemoryAgentBench evaluation dataset (processed version for this project)
git clone https://huggingface.co/datasets/YuWangX/Memalpha-Memoryagentbench ./data/memoryagentbenchNote: We use a processed version of the original MemoryAgentBench dataset. See Dataset Processing for details.
Expected directory structure:
data/
├── memalpha/
│   ├── train.parquet
│   └── test.parquet
└── memoryagentbench/
    ├── train.parquet
    └── test.parquet
Note: If you prefer to process the datasets from scratch, see Dataset Processing.
To train the Memalpha-4B model with optimal hyperparameters (β=0.05, γ=0.1):
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.05-content0.1.shThe following scripts reproduce the ablation study results from the paper:
# β=0.05, γ=0.0 (No content reward)
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.05-content0.0.sh
# β=0.0, γ=0.1 (No compression reward)
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.0-content0.1.sh
# β=0.05, γ=0.1 (Main configuration)
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.05-content0.1.sh
# β=0.2, γ=0.1 (Higher compression penalty)
bash scripts/train_memory_grpo_qwen3-8b-4node-compression0.2-content0.1.sh
# β=0.4, γ=0.1 (Highest compression penalty)
bash scripts/train_memory_grpo_qwen3-4b-4node-compression0.4-content0.1.shParameter explanations:
- β (beta): Compression reward coefficient - penalizes excessive memory usage
 - γ (gamma): Content reward coefficient - measure whether the information is put into the correct memory type
 
Before evaluation, start the memory server that manages the memory system:
python memory_server.py --port 5006 --server_url http://localhost:8001/v1Evaluate the trained Memalpha-4B model on both datasets:
# Evaluate on Memalpha dataset
python main.py --agent_config config/memalpha-qwen3-4b_agent_0.05-0.1.yaml --dataset memalpha
# Evaluate on MemoryAgentBench dataset
python main.py --agent_config config/memalpha-qwen3-4b_agent_0.05-0.1.yaml --dataset memoryagentbenchWe provide evaluation scripts for several baseline methods:
Evaluate long-context models and BM25-based retrieval on both datasets:
# Memalpha dataset
python long_context_eval.py --model qwen3-32b --dataset memalpha              # Long-context baseline
python long_context_eval.py --model qwen3-32b-bm25 --dataset memalpha         # BM25 retrieval
python long_context_eval.py --model gpt-4o-mini --dataset memalpha            # GPT-4o-mini
python long_context_eval.py --model memagent-14b --dataset memalpha           # MemAgent baseline
# MemoryAgentBench dataset
python long_context_eval.py --model qwen3-32b --dataset memoryagentbench
python long_context_eval.py --model qwen3-32b-bm25 --dataset memoryagentbench
python long_context_eval.py --model gpt-4o-mini --dataset memoryagentbench
python long_context_eval.py --model memagent-14b --dataset memoryagentbenchTo evaluate MEM1 as a baseline:
# Start the VLLM server for MEM1
cd MEM1/Mem1/inference
bash start_vllm.sh
cd ../../..
# Run MEM1 evaluation on both datasets
cd MEM1/Mem1
python inference/generate_rollout.py \
    --model Mem-Lab/Qwen2.5-7B-RL-RAG-Q2-EM-Release \
    --use_mem1 \
    --data_file ../../data/memalpha/test.parquet
python inference/generate_rollout.py \
    --model Mem-Lab/Qwen2.5-7B-RL-RAG-Q2-EM-Release \
    --use_mem1 \
    --data_file ../../data/memoryagentbench/test.parquetIf you want to build the Memalpha dataset from scratch instead of downloading it:
# Process individual datasets
python process_data.py --dataset squad
python process_data.py --dataset squad --split-single-dataset
python process_data.py --dataset hotpotqa
python process_data.py --dataset hotpotqa --split-single-dataset
python process_data.py --dataset booksum
python data_preprocess/extract_booksum_keywords.py --mode replace
python process_data.py --dataset booksum --split-single-dataset
python process_data.py --dataset pubmed-rct
python process_data.py --dataset pubmed-rct --split-single-dataset
python process_data.py --dataset perltqa
python process_data.py --dataset perltqa --split-single-dataset
python process_data.py --dataset ttl_train
python process_data.py --dataset ttl_train --split-single-dataset
python process_data.py --dataset lme_train
python process_data.py --dataset lme_train --split-single-dataset
# Merge all datasets
python process_data.py --merge-datasets pubmed-rct lme_train squad hotpotqa perltqa ttl_train booksum --limit-size 100To build the MemoryAgentBench evaluation dataset from scratch:
# Process MemoryAgentBench components
python process_data.py --dataset accurate_retrieval
python process_data.py --dataset test_time_learning
python process_data.py --dataset long_range_understanding
# Merge into final evaluation set
python process_data.py --merge-datasets accurate_retrieval long_range_understanding test_time_learning --output-name memoryagentbench
⚠️ Warning: Since MemoryAgentBench is continuously updated, processing from scratch may yield different results than the published dataset. We recommend downloading our processed version directly from HuggingFace for reproducibility.Note: Our evaluation uses a processed version of the original MemoryAgentBench dataset (paper). The processing scripts above show how we adapted it for our experiments.
If you find this work helpful, please cite our paper:
@article{wang2025memalpha,
  title={Mem-$\alpha$: Learning Memory Construction via Reinforcement Learning},
  author={Wang, Yu and Takanobu, Ryuichi and Liang, Zhiqi and Mao, Yuzhen and Hu, Yuanzhe and McAuley, Julian and Wu, Xiaojian},
  journal={arXiv preprint arXiv:2509.25911},
  year={2025}
}If you use our processed MemoryAgentBench dataset, please also cite the original work:
@article{hu2025evaluating,
  title={Evaluating memory in llm agents via incremental multi-turn interactions},
  author={Hu, Yuanzhe and Wang, Yu and McAuley, Julian},
  journal={arXiv preprint arXiv:2507.05257},
  year={2025}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contact: For questions or issues, please open an issue on GitHub or contact Yu Wang at yuw164@ucsd.edu.