Skip to content

This repository is dedicated to our source code for our research paper titled Synthetic Malware Image Generation Based on Generative Models Against Zero-Day Attacks. We presented our research work at the Silicon Valley Cybersecurity Conference 2025.

Notifications You must be signed in to change notification settings

arjunsudheer/synthetic-malware-generation-based-on-generative-models-against-zero-day-attacks

Repository files navigation

synthetic-malware-generation-based-on-generative-models-against-zero-day-attacks

This repository is dedicated to our source code for our research paper titled Synthetic Malware Image Generation Based on Generative Models Against Zero-Day Attacks. We presented our research work at the Silicon Valley Cybersecurity Conference 2025.

Dataset

The dataset is comprised of 32 x 32 images in the monochrome, grayscale, rgb, and cmyk color spaces. We create datasets of 100 samples, 250 samples, 500 samples, and 1000 samples. We use the v001 VirusShare and Malicia datasets.

The scripts we used to convert the malware binary data into image format can be found at this directory: src/binary_to_image_conversion

The original v001 VirusShare malware dataset can be found at this link: https://onedrive.live.com/?authkey=%21ADHc%2D6l1oKh4DjU&id=7DE80EDD88C2C9DE%21348129&cid=7DE80EDD88C2C9DE

The original Malicia malware dataset can be found at this link: https://mega.nz/file/LdFHyIqJ#H9zOeQL3z6m24t2Bz0PaQqlR026b74Ws3Uys4Kooc9c

Preparing Data

If you are interested in running our script to convert the malware binaries into image format, then you can follow the steps listed below. We highly recommend you run this script in a virtual machine (VM) since you will be dealing with real malware binaries and may risk infecting your computer.

In your newly created Windows 11 Virtual Machine:

  1. Create a directory called datasets. This is where all your dataset files will be.
  2. The zip_benign_exe_files.ps1 PowerShell script is provided to generate the benign dataset. This script will create a zip file in the user's Downloads folder containing all the benign executable files in the user C drive. Note that some permissions may block the PowerShell script from accessing the executable files. This is ok as our implementation requires a minimum of 4,000 benign executable files. To run the PowerShell script, navigate to the directory that the PowerShell script is located in, and execute the following command: .\zip_benign_exe_files.ps1. Note that the Execution Policy on Windows client computers is set to Restricted by default, which does not allow scripts to run. Please refer to Microsoft's documentation on how to update your system's Execution Policy if needed. Once the zip file is generated, unzip the file in your datasets directory.
  3. Download the v001 VirusShare and Malicia datasets. Extract all the contents from both datasets, and obtain the csv files for both datasets. The password for the v001 VirusShare dataset is "infected". Store the binary executable files from the v001 VirusShare dataset in a folder named v001, and the malware binaries from the Malicia dataset in a a folder named binaries_RELEASE1.0. Please store the malware data and the corresponding csv files in a parent folder called malicious_data. Below is the file/folder hierarchy that the image conversion scripts expect.
- datasets
  - benign_data (see step 2)
    - All the benign windows executables from your C: drive.
  - malicious_data (see step 3)
    - v001
      - All the malware binaries from the v001 VirusShare dataset.
    - binaries_RELEASE1.0
      - All the malware binaries from the Malicia dataset
    - 00001.csv (csv file for the malware binaries from the v001 VirusShare dataset)
    - malicia_binaries.csv (csv file for the malware binaries from the Malicia dataset)
  1. Install the required Python packages using this command: pip3 -r install "dataset_requirements.txt". We recommend doing this step in a Python virtual environment, although not required.
  2. Run the image conversion process by running the provided dataset_process.sh bash script with this command: ./dataset_process.sh. Note that you may have to provide executable permissions to the file first. The script will automatically convert the malware binaries into images, and then create a zip file of the converted images for easy file-sharing.

Models

We trained a Wasserstein a General Adversarial Network with Gradient Penalty (WGAN-GP) and a Diffusion model on our created malware images dataset to compare the effectiveness of each model in generating high quality synthetic malware image samples. We evaluate the effectiveness of each model using a cosine similarity score, tSNE visualizations, Random Forest and Multi-layer Perceptron multi-class classifiers, and Random Forest and Multi-layer Perceptron binary classifiers.

The Diffusion model script are located in the following directory: src/diffusion

The WGAN-GP model scripts are located in the following directory: src/wgan_gp

The evaluation metrics scripts are located in the following directory: src/evaluation

Training

Both the Diffusion and WGAN-GP models are implemented in PyTorch and can utilize an Nvidia GPU for accelerated training. Both models can run on just a CPU if an Nvidia GPU is not available, although much slower. To speed up the evaluation metrics collection process, we parallelize up to 8 tasks at a time, which may may max out CPU, GPU, and/or system memory usage for a few minutes.

If you are interested in training the Diffusion model or WGAN-GP model on your system, then you can follow the steps listed below:

  1. Install the required Python packages using this command: pip3 -r install "model_requirements.txt". We recommend doing this step in a Python virtual environment, although not required.
  2. To run the Diffusion model training and evaluation metrics collection process, you can execute the provided diffusion_process.sh bash script with this command: ./diffusion_process.sh. Note that you may have to provide executable permissions to the file first.
  3. To run the WGAN-GP model training and evaluation metrics collection process, you can execute the provided wgan_gp_process.sh bash script with this command: ./wgan_gp_process.sh. Note that you may have to provide executable permissions to the file first.

About

This repository is dedicated to our source code for our research paper titled Synthetic Malware Image Generation Based on Generative Models Against Zero-Day Attacks. We presented our research work at the Silicon Valley Cybersecurity Conference 2025.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages