Skip to content

Trouble using ByteWeight #1634

@infinitesteps

Description

@infinitesteps

I created an experiment using 731 executables from the base Ubuntu 22 system (sample set):

For ground-truth function start data I retrieved debug information for executables in the sample set using debuginfod (from: https://debuginfod.ubuntu.com, 450 executables had debug info available). For files that had debug info, I extracted the functions starts.

To validate my ground truth data, I compared the 29,481 function starts found in debug info to those found in the .eh_frame section of the original executables. I limited the functions starts to those functions that had a function size larger than 0 according to the debug information. All but 3 function starts were found in the .eh_frame sections.

Based on the bap project's Ubuntu noble Dockerfile, I installed bap and it's dependencies. I added bap-byteweight-frontend package to the base bap system. For each executable I ran the following command to find function starts:

.opam/4.14/bin/bap-byteweight find <exe path>

I compared the results to my ground truth data and it did not compare well.

  • Total functions according to debug data: 29,481
  • Functions found in .eh_frame: 29,478
  • Functions found by ByteWeight: 1,401

I used the default function signatures that are shipped with bap which appear to come from:

and has the structure:

.
├── Ghidra
│   ├── LICENSE
│   ├── NOTICE
│   └── Processors
...
│       └── x86
│           └── data
│               └── patterns
│                   ├── patternconstraints.xml
│                   ├── x86-16_default_patterns.xml
│                   ├── x86-64gcc_patterns.xml
│                   ├── x86-64win_patterns.xml
│                   ├── x86delphi_patterns.xml
│                   ├── x86gcc_patterns.xml
│                   └── x86win_patterns.xml
├── sigs
...
│   └── x86_64
│       ├── clang
│       │   └── bytes
│       ├── default
│       │   └── bytes
│       ├── gcc
│       │   └── bytes
│       ├── gcc-8
│       │   └── bytes
│       └── icc
│           └── bytes
...

61 directories, 47 files

Naturally the results have raised questions for me. I am trying to understand the following:

  • What explains the low function start recovery rate by bap-byteweight?

    • Do I need to specify which signatures to use?
  • How can I generate new byteweight signatures? I see that there is a "train" command for bap-byteweight. What is the format of the set of files? Must they have debug information in them?

    • If I want to improve recovery rate, what should I use as the training set?
    • How do I instruct bap-byteweight to use different signatures?
  • There are many subdirectories in the bap signatures package. They are categorized by architecture and compiler. When the bap-byteweight "find" command is run, which signatures are used?

    • An unknown binary would not have the compiler embedded in it (in general).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions