Skip to content

A reinforcement learning-based system designed to detect and prevent jailbreak attempts in AI models, ensuring safe and controlled model behavior under adversarial conditions.

License

Notifications You must be signed in to change notification settings

Manohara-Ai/Reinforcement_Learning_Framework_to_Prevent_Jailbreaks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning Framework to Prevent Jailbreaks

A reinforcement learning-based system designed to detect and prevent jailbreak attempts in AI models, ensuring safe and controlled behavior under adversarial conditions.

Workflow


About

This project implements a scalable, model-agnostic RL framework that interacts with LLM outputs and decides whether to allow, modify, or reject prompts, balancing task usefulness and safety. The framework supports a self-sustained training cycle for continuous improvement.


Methodology

  1. Prompt Processing

    • Extract verb phrases from prompts using a custom DistilBERT encoder.
    • Represent phrases as embeddings for the RL agent.
  2. RL Agent

    • Actions: {allow, modify, reject}
    • Reward balances usefulness vs penalties for harmful outputs.
    • Closed feedback loop for adaptive policy improvement.
  3. Scalability & Flexibility

    • Works with any LLM.
    • Defends against diverse jailbreak strategies.
    • Supports continuous learning with improved filtering models.

Repository Structure

Reinforcement_Learning_Framework_to_Prevent_Jailbreaks/
│
├── config/           # Configuration files for training and evaluation
├── datasets/         # Prompt datasets
├── resources/        # other resources
├── utils/            # Utility functions
├── main.py           # Entry point for training and evaluation
├── logs.txt          # Training/evaluation logs
├── README.md         # Project description
└── LICENSE           # MIT license

Usage

  1. Clone the repository:
git clone https://github.com/Manohara-Ai/Reinforcement_Learning_Framework_to_Prevent_Jailbreaks.git
  1. Install dependencies

  2. Run the RL agent:

python main.py

Future Work

  • Extend RL policies with multi-agent strategies for more complex scenarios.
  • Improve embeddings and filtering with advanced NLP methods.
  • Integrate with larger LLMs for broader deployment.

Contributors

  • Manohara B M
  • Prarthana P Kulkarni

License

MIT License

About

A reinforcement learning-based system designed to detect and prevent jailbreak attempts in AI models, ensuring safe and controlled model behavior under adversarial conditions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages