https://arxiv.org/abs/2411.18516
- Environments used for portions of the pipeline are different due to versioning issues and varying package managers.
- As a result, the code is a bit disjoint but still straightforward.
- For the ML portion of the codebase we use a Conda environment and for the Yara portion we use a venv.
- Yara: pip install -r requirementsYara.txt
- ML: conda create --name ML --file requirementsConda.txt
 
- Yara: 
- Dependencies in the ML environment are VERY finnicky.
- As a result, the Ember dependency itself is broken for us.
- We have begun to load its .datfiles without the usage ofEmberbut this still does not fully work.
- It is up to you to get this set up properly.
 
The pipeline is defined in the section below. How to run this is code is given here:
- Create and activate pip venv.
- python3 main.py extract <path-to-yara-rules-to-harvest-from>- Dumps harvested rules to data/yar/rules.yar.
 
- Dumps harvested rules to 
- Run the rules over malware with Yara.
- python3 main.py matrix --yara-outfile <your-yara-output-file-path>- Dumps feature matrices to the following paths:
- data/npz/all_matrix.npz: Unsplit matrix.
- data/npz/train_matrix.npz
- data/npz/train_labels.npz
- data/npz/test_matrix.npz
- data/npz/test_labels.npz
 
 
- Dumps feature matrices to the following paths:
- Create and activate conda env.
- python3 lasso_grids.py- Use lasso-penalized logistic regression over a grid of lambda to select varying numbers of YARA rules as features.
- We propose methods for selection which condition on the existing EMBER features to find YARA rules which add additional value.
 
- python3 xgb_models.py- Classification accuracy on EMBER 2018 across varying numbers of used YARA rules. The used YARA rules are defined by the grids computed in lasso_grids.py.
 
- Classification accuracy on EMBER 2018 across varying numbers of used YARA rules. The used YARA rules are defined by the grids computed in 
- python3 compare_models.py- Assessing how other models (LGBM and RF) compare with XGBoost.
 
- All of the plot_*.pyfiles.
- lib.extract.strings_as_yara_rules- Extract strings from existing Yara rules and create a new ruleset.
 
- Your dataset
- Run generated Yara rules over malware samples.
 
- lib.matrix.to_sparse -> lib.matrix.subset_split
- ml/lasso_grids.py- Generate lasso grids for your Yara output and the Ember data.
- Required for the subsequent steps.
 
- ml/xgb_models.py
- ml/compare_models.py
- ml/plot_*.py- Figure generation for each point.
 
@misc{gupta2024livinganalystharvestingfeatures,
      title={Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection}, 
      author={Siddhant Gupta and Fred Lu and Andrew Barlow and Edward Raff and Francis Ferraro and Cynthia Matuszek and Charles Nicholas and James Holt},
      year={2024},
      eprint={2411.18516},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2411.18516}, 
}