-
Notifications
You must be signed in to change notification settings - Fork 279
Description
I created an experiment using 731 executables from the base Ubuntu 22 system (sample set):
For ground-truth function start data I retrieved debug information for executables in the sample set using debuginfod (from: https://debuginfod.ubuntu.com, 450 executables had debug info available). For files that had debug info, I extracted the functions starts.
To validate my ground truth data, I compared the 29,481 function starts found in debug info to those found in the .eh_frame section of the original executables. I limited the functions starts to those functions that had a function size larger than 0 according to the debug information. All but 3 function starts were found in the .eh_frame sections.
Based on the bap project's Ubuntu noble Dockerfile, I installed bap and it's dependencies. I added bap-byteweight-frontend
package to the base bap system. For each executable I ran the following command to find function starts:
.opam/4.14/bin/bap-byteweight find <exe path>
I compared the results to my ground truth data and it did not compare well.
- Total functions according to debug data: 29,481
- Functions found in .eh_frame: 29,478
- Functions found by ByteWeight: 1,401
I used the default function signatures that are shipped with bap which appear to come from:
and has the structure:
.
├── Ghidra
│ ├── LICENSE
│ ├── NOTICE
│ └── Processors
...
│ └── x86
│ └── data
│ └── patterns
│ ├── patternconstraints.xml
│ ├── x86-16_default_patterns.xml
│ ├── x86-64gcc_patterns.xml
│ ├── x86-64win_patterns.xml
│ ├── x86delphi_patterns.xml
│ ├── x86gcc_patterns.xml
│ └── x86win_patterns.xml
├── sigs
...
│ └── x86_64
│ ├── clang
│ │ └── bytes
│ ├── default
│ │ └── bytes
│ ├── gcc
│ │ └── bytes
│ ├── gcc-8
│ │ └── bytes
│ └── icc
│ └── bytes
...
61 directories, 47 files
Naturally the results have raised questions for me. I am trying to understand the following:
-
What explains the low function start recovery rate by bap-byteweight?
- Do I need to specify which signatures to use?
-
How can I generate new byteweight signatures? I see that there is a "train" command for bap-byteweight. What is the format of the set of files? Must they have debug information in them?
- If I want to improve recovery rate, what should I use as the training set?
- How do I instruct bap-byteweight to use different signatures?
-
There are many subdirectories in the bap signatures package. They are categorized by architecture and compiler. When the bap-byteweight "find" command is run, which signatures are used?
- An unknown binary would not have the compiler embedded in it (in general).