MAGIC Genomic Features

The MAGIC Genomic Features provide low level details of analyzed malware binaries and are extracted using cutting edge program analysis techniques.

API Endpoints

GET /show/binary/(api_key)/(sha1)

Retrieve the genomic features for binary with SHA1 of (sha1). See Genomic Features for a description of the features.

https://api.magic.cythereal.com/docs#!/show/Show_Binary

GET /show/binary/(api_key)/(sha1)/(rva)

Retrieve the genomic features for the procedure at address (rva) in binary with SHA1 of (sha1). See Genomic Features for a description of the genomic features.

https://api.magic.cythereal.com/docs#!/show/Show_Proc

CLI Commands

Usage: vbclient -a show SHA1 ...

Options:
  -o OUTDIR, --outdir=OUTDIR
                        Directory to save downloaded files. Default is:
                        ./Results

The -a show option for vbclient will create a file in OUTDIR containing the JSON representation of the binary with SHA1 equal to SHA1. See Genomic Features for a description of this JSON document.

Genomic Features

Warning

Any fields not documented here should be considered deprecated. Such fields may be removed in the future without warning.

A binary is represented as a set of functions and each function contains a set of genomic features characterizing the function. The genomic feature listing for a binary is simply a dictionary containing the set of features and their associated features. An abridged example is below.

{
    "0x1000": {...},
    "0x109f": {...},
    "0x109f": {...},
    ...
}

Each key in the dictionary is the RVA of a function in the query binary. The values in this dictionary are the genomic features for each function.

An example genomic feature dictionary for a function is given below:

{
    "_id": "2d9035b27ce5c1cce2fb8432c76e315a1a2a8de0/0x113a0",
    "binary_id": "2d9035b27ce5c1cce2fb8432c76e315a1a2a8de0",
    "hardHash": "9536f45b50274da96806b2aad09119ef",
    "startRVA": "0x113a0",
    "isLibrary": false,
    "isThunk": false,
    "peSegment": "_text",
    "procName": "sub_4113A0",
    "api_calls": [
        "VariantInit",
        "VariantCopy",
        ...
    ],
    "code": [
        [
            "push(ebp)",
            "mov(ebp,esp)",
            "push(ebx)",
            ...
        ],
        ...
    ],
    "code_size": 187,
    "gen_code": [
        [
            "push(A)",
            "mov(A,B)",
            "push(C)",
            ...
        ],
        ...
    ],
    "gen_code_size": 187,
    "semantics": [
        [
            "eax=A",
            "ebp=C",
            "esp=B",
            ...
        ],
        ...
    ],
    "semantics_size": 149,
    "gen_semantics": [
        [
            "A=R",
            "B=Q",
            "G=P",
            ...
        ],
        ...
    ],
    "gen_semantics_size": 149,
}

The first few fields, _id, binary_id and hardHash provide identifiers for this procedure. The next few fields base_address, isLibary, isThunk, peSegment, and procName provide contextual information about the function. The remaining fields provide the genomic features.

Identification Fields

_id
The ID of the function. It takes the form “binary_id/startRVA”.
binary_id
The SHA1 of the binary this function is in.
hardHash
This is a special hash created by MAGIC to identify semantically equivalent functions. If two functions have an equivalent hardHash, then the semantics of both functions are the same, i.e. the functions can be considered functionally equivalent.

Contextual Fields

startRVA
The relative virtual address of the function.
isLibary
Indicates if this function was identified as a library function. Should be either true or false.
isThunk
Indicates if this function was identified as a thunk (jump) function. Should be either true or false.
peSegment
Lists the PE Segment that the function was located in.
procName
The name of the procedure. If debug symbols were not present in the original file, this will be of the form sub_xxxxxx where xxxxxx is the virtual address of the function.

Feature Fields

The feature fields are summarized in this section. The details are in the paper “Fast location of similar code fragments using semantic juice”.

api_calls
The list of APIs called from this function.
code
The disassembly of this function. This is a list of lists where each sublist represents a disassembly block. Each block contains a list of the instructions that compose the block.
code_size
The number of instructions in the disassembly.
gen_code
Generalized disassembly. This is the same as code, but with the operands in code abstracted to variables instead of register names.
gen_code_size
The number of generalized instructions in gen_code.
semantics
The semantics of the instructions in code. This is a list of lists where each inner list corresponds to a block from code. This inner list contains the effect that executing the represented block will have on registers and memory locations.
semantics_size
The number of individual semantics statements in semantics.
gen_semantics
Generalized semantics. This is the same as semantics but with registers and memory locations abstracted to variables.
gen_semantics_size
The number of individual semantics statements in gen_semantics.