A reinforcement learning-based system designed to detect and prevent jailbreak attempts in AI models, ensuring safe and controlled behavior under adversarial conditions.
This project implements a scalable, model-agnostic RL framework that interacts with LLM outputs and decides whether to allow, modify, or reject prompts, balancing task usefulness and safety. The framework supports a self-sustained training cycle for continuous improvement.
-
Prompt Processing
- Extract verb phrases from prompts using a custom DistilBERT encoder.
- Represent phrases as embeddings for the RL agent.
-
RL Agent
- Actions:
{allow, modify, reject}
- Reward balances usefulness vs penalties for harmful outputs.
- Closed feedback loop for adaptive policy improvement.
- Actions:
-
Scalability & Flexibility
- Works with any LLM.
- Defends against diverse jailbreak strategies.
- Supports continuous learning with improved filtering models.
Reinforcement_Learning_Framework_to_Prevent_Jailbreaks/
│
├── config/ # Configuration files for training and evaluation
├── datasets/ # Prompt datasets
├── resources/ # other resources
├── utils/ # Utility functions
├── main.py # Entry point for training and evaluation
├── logs.txt # Training/evaluation logs
├── README.md # Project description
└── LICENSE # MIT license
- Clone the repository:
git clone https://github.com/Manohara-Ai/Reinforcement_Learning_Framework_to_Prevent_Jailbreaks.git
-
Install dependencies
-
Run the RL agent:
python main.py
- Extend RL policies with multi-agent strategies for more complex scenarios.
- Improve embeddings and filtering with advanced NLP methods.
- Integrate with larger LLMs for broader deployment.
MIT License