Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection

https://arxiv.org/abs/2411.18516

Notes:

Environments used for portions of the pipeline are different due to versioning issues and varying package managers.
As a result, the code is a bit disjoint but still straightforward.
For the ML portion of the codebase we use a Conda environment and for the Yara portion we use a venv.
- Yara: pip install -r requirementsYara.txt
- ML: conda create --name ML --file requirementsConda.txt
Dependencies in the ML environment are VERY finnicky.
- As a result, the Ember dependency itself is broken for us.
- We have begun to load its .dat files without the usage of Ember but this still does not fully work.
- It is up to you to get this set up properly.

How to Run:

The pipeline is defined in the section below. How to run this is code is given here:

Create and activate pip venv.
python3 main.py extract <path-to-yara-rules-to-harvest-from>
- Dumps harvested rules to data/yar/rules.yar.
Run the rules over malware with Yara.
python3 main.py matrix --yara-outfile <your-yara-output-file-path>
- Dumps feature matrices to the following paths:
  - data/npz/all_matrix.npz: Unsplit matrix.
  - data/npz/train_matrix.npz
  - data/npz/train_labels.npz
  - data/npz/test_matrix.npz
  - data/npz/test_labels.npz
Create and activate conda env.
python3 lasso_grids.py
- Use lasso-penalized logistic regression over a grid of lambda to select varying numbers of YARA rules as features.
- We propose methods for selection which condition on the existing EMBER features to find YARA rules which add additional value.
python3 xgb_models.py
- Classification accuracy on EMBER 2018 across varying numbers of used YARA rules. The used YARA rules are defined by the grids computed in lasso_grids.py.
python3 compare_models.py
- Assessing how other models (LGBM and RF) compare with XGBoost.
All of the plot_*.py files.

The Pipeline:

lib.extract.strings_as_yara_rules
- Extract strings from existing Yara rules and create a new ruleset.
Your dataset
- Run generated Yara rules over malware samples.
lib.matrix.to_sparse -> lib.matrix.subset_split
ml/lasso_grids.py
- Generate lasso grids for your Yara output and the Ember data.
- Required for the subsequent steps.
ml/xgb_models.py
ml/compare_models.py
ml/plot_*.py
- Figure generation for each point.

Cite us!

@misc{gupta2024livinganalystharvestingfeatures,
      title={Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection}, 
      author={Siddhant Gupta and Fred Lu and Andrew Barlow and Edward Raff and Francis Ferraro and Cynthia Matuszek and Charles Nicholas and James Holt},
      year={2024},
      eprint={2411.18516},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2411.18516}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/csv		data/csv
lib		lib
ml		ml
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirementsConda.txt		requirementsConda.txt
requirementsYara.txt		requirementsYara.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection

Notes:

How to Run:

The Pipeline:

Cite us!

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

FutureComputing4AI/LivingOffTheAnalyst

Folders and files

Latest commit

History

Repository files navigation

Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection

Notes:

How to Run:

The Pipeline:

Cite us!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages