BaxBench

[13.09.2025] Release v1.0.0

The repository has been updated to fully reproduce the latest version of the paper. If you are using BaxBench for your work, please consider rebasing to this latest version. The changes involve minor fixes in scenario tests and exploits, and migrating the environments to Debian Bullseye. See the full changelog in the corresponding release announcement

👋 Overview

BaxBench is a coding benchmark for evaluating the ability of LLMs on generating correct and secure code in realistic, security-critical settings. Each coding task in BaxBench consists of a scenario, describing the API the backend application should implement, and a framework, fixing the implementation language and backend framework to use. The scenarios can be found here, while all supported frameworks are included here.

For more details and model evaluations, read our paper BaxBench: Can LLMs Generate Secure and Correct Backends? or visit our website.

Assets

📜 Paper: BaxBench: Can LLMs Generate Secure and Correct Backends?
🏆 Website & Leaderboard: baxbench.com
🤗 Dataset: datasets/LogicStar/BaxBench

🚀 Installation

Prerequisites:

python 3.12: Install it from here.
docker: Follow the instructions for installing Docker Desktop here (Windows, MacOS, Linux) or for the Docker engine here (Linux). Make sure that Docker has root privileges on your machine.
pipenv: The project uses pipenv for package management. You can install pipenv by following the instructions here.

Setting up the environment and running scripts

After ensuring that all prerequisites are installed, you can install the environment by running pipenv install from the root of the repository. Please ensure that this action does not change Pipfile.lock. To run any Python script in the project environment, run from the project root using the command:

pipenv run python <path_to_python_script> <args>

Setting API keys

To generate BaxBench task solutions, the current repository requires the user to set the following environment variables to API keys stored in environment variables in your .bashrc or the equivalent configuration file of your system:

export OPENAI_API_KEY="<your_API_key>"
export TOGETHER_API_KEY="<your_API_key>"
export ANTHROPIC_API_KEY="<your_API_key>"
export OPENROUTER_API_KEY="<your_API_key>"

Note: You may set any API key you do not intend to use simply to an empty or invalid string.

💫 Contributing

We welcome contributions from the community. You may contribute by:

Adding a scenario:

Create a new scenario in the scenarios directory. Look at other scenarios as an example for what has to be there for completeness.
Add the scenario to the scenarios list in src/scenarios/__init__.py.
Open a pull request to integrate your scenario into the main branch.
Adding a new framework:

Create a new scenario in the env directory. Look at other environments as an example for what has to be there for completeness.
Add the scenario to the envs list in src/env/__init__.py.
Open a pull request to integrate your scenario into the main branch.
Adding tests to a scenario:

Open a pull request modifying the given scenario file to add further functionality tests or security exploits.
Raising issues or giving feedback:

If you identify any issues or want to share feedback with us, you may either contact us directly or raise an issue on GitHub. We are looking forward to working with the community and are extremely thankful for any contributions!

Note: Before contributing code, please run pipenv run pre-commit install in the root once to set up the pre-commit hooks.

👨🏻‍💻 Usage

Generating programs

To generate solutions to all scenarios in the scenarios list, run the following command:

pipenv run python src/main.py --models gpt-4o --mode generate --n_samples 10 --temperature 0.4

To restrict the generation to a subset of scenarios or environments, see the "Advanced" section below.

The programs and the generation logs will be saved in the directory results.

Testing generated programs

Run: pipenv run python src/main.py --models gpt-4o --mode test --n_samples 10 --temperature 0.4 to test your generated solutions.

If you have generated solutions externally, e.g., using our Hugging Face dataset, make sure to include the generated solutions under the following path w.r.t. the root of this repository:

results/<model_name>/<scenario_id>/<env_id>/temp<t>-<spec_type>-<prompt_type>/sample<s>/code

Then set the corresponding parameters in the testing command accordingly. See "Advanced" below or the argument list in main.py.

Evaluating and printing

Run: pipenv run python src/main.py --models gpt-4o --mode evaluate --n_samples 10 --temperature 0.4 to print your results to a table in your console.

Advanced

Specific models/scenarios/frameworks/samples can be generated, tested, or evaluated by specifying the following arguments in the CLI:

--models
--scenarios
--envs
--only_samples
--safety_prompt
--spec_type

Each of these arguments takes values separated by spaces.

✍️ Citation

If you find our work helpful, please use the following citation.

@article{vero2025baxbenchllmsgeneratecorrect,
        title={BaxBench: Can LLMs Generate Correct and Secure Backends?}, 
        author={Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev},
        year={2025},
        eprint={2502.11844},
        archivePrefix={arXiv},
}

📝 License

MIT. Check LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
static		static
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
mypy.ini		mypy.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BaxBench

[13.09.2025] Release v1.0.0

👋 Overview

Assets

🚀 Installation

💫 Contributing

👨🏻‍💻 Usage

Generating programs

Testing generated programs

Evaluating and printing

Advanced

✍️ Citation

📝 License

About

Uh oh!

Releases 2

Packages

Contributors 5

Uh oh!

Languages

License

logic-star-ai/baxbench

Folders and files

Latest commit

History

Repository files navigation

BaxBench

[13.09.2025] Release v1.0.0

👋 Overview

Assets

🚀 Installation

💫 Contributing

👨🏻‍💻 Usage

Generating programs

Testing generated programs

Evaluating and printing

Advanced

✍️ Citation

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 5

Uh oh!

Languages

Packages