This article provides an introduction to the mathematical foundations and algorithmic frameworks used to align Large Language Models (LLMs) with human intentions, preferences, and values. We discuss standard alignment techniques, such as fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and direct preference optimization (DPO). We also explore the theoretical underpinnings of learning from human preferences, drawing connections to inverse reinforcement learning (IRL) and discrete choice models. We present state-of-the-art algorithms in a tutorial style, discuss their advantages and limitations, and offer insights into practical implementation. Our exposition is intended to serve as a comprehensive resource for researchers and practitioners, providing both a foundational understanding of alignment methodologies and a framework for developing more robust and scalable alignment techniques.
Figure: Overview of the alignment framework and methodologies discussed in this survey.
- Learning from Human Preferences, Reinforcement Learning from Human Feedback
- Learning from Demonstrations, Inverse Reinforcement Learning
- Alignment Methodologies
- Algorithm Design and Implementation
- Practical Considerations and Limitations
@article{zeng2025aligning,
title={Aligning Large Language Models with Human Feedback: Mathematical Foundations and Algorithm Design},
author={Zeng, Siliang and Viano, Luca and Li, Chenliang and Li, Jiaxiang and Cevher, Volkan and Wulfmeier, Markus and Ermon, Stefano and Garcia, Alfredo and Hong, Mingyi},
year={2025}
}
