Reinforcement Learning Framework to Prevent Jailbreaks

A reinforcement learning-based system designed to detect and prevent jailbreak attempts in AI models, ensuring safe and controlled behavior under adversarial conditions.

About

This project implements a scalable, model-agnostic RL framework that interacts with LLM outputs and decides whether to allow, modify, or reject prompts, balancing task usefulness and safety. The framework supports a self-sustained training cycle for continuous improvement.

Methodology

Prompt Processing
- Extract verb phrases from prompts using a custom DistilBERT encoder.
- Represent phrases as embeddings for the RL agent.
RL Agent
- Actions: {allow, modify, reject}
- Reward balances usefulness vs penalties for harmful outputs.
- Closed feedback loop for adaptive policy improvement.
Scalability & Flexibility
- Works with any LLM.
- Defends against diverse jailbreak strategies.
- Supports continuous learning with improved filtering models.

Repository Structure

Reinforcement_Learning_Framework_to_Prevent_Jailbreaks/
│
├── config/           # Configuration files for training and evaluation
├── datasets/         # Prompt datasets
├── resources/        # other resources
├── utils/            # Utility functions
├── main.py           # Entry point for training and evaluation
├── logs.txt          # Training/evaluation logs
├── README.md         # Project description
└── LICENSE           # MIT license

Usage

Clone the repository:

git clone https://github.com/Manohara-Ai/Reinforcement_Learning_Framework_to_Prevent_Jailbreaks.git

Install dependencies
Run the RL agent:

python main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning Framework to Prevent Jailbreaks

About

Methodology

Repository Structure

Usage

Future Work

Contributors

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
datasets		datasets
resources		resources
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
jailbreak.jpeg		jailbreak.jpeg
logs.txt		logs.txt
main.py		main.py

License

Manohara-Ai/Reinforcement_Learning_Framework_to_Prevent_Jailbreaks

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning Framework to Prevent Jailbreaks

About

Methodology

Repository Structure

Usage

Future Work

Contributors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages