Learning miasm: Part 1: Loading a PE

January 9, 2020 #python #reverse-engineering #miasm #disassembly

With the turning of the decade, Python 2.7 is unmaintained. This is as good time as any to review the frameworks and libraries that we rely upon - and possibly doing updates or refreshes. For many years, I’ve reached for vivisect when I needed to programmatically disassemble a file. Its pure Python (so easy distributable) and provides many features out of the box (disassembler, emulator, debugger, and symbolic analysis). Unfortunately, its strictly py2 with few signs of change.

miasm seems like a promising alternative that supports Python 3.x. In their words:

Miasm is a free and open source (GPLv2) reverse engineering framework. Miasm aims to analyze / modify / generate binary programs. Here is a non exhaustive list of features:

Opening / modifying / generating PE / ELF 32 / 64 LE / BE

Assembling / Disassembling X86 / ARM / MIPS / SH4 / MSP430

Representing assembly semantic using intermediate language

Emulating using JIT (dynamic code analysis, unpacking, …)

Expression simplification for automatic de-obfuscation

I’m going to explore miasm and see if it offers features that I’d find useful as a reverse engineer. As I have time, I’ll post notes, code snippets, impressions, and conclusions.

Things I probably want

I work as a reverse engineer on the FLARE team analyzing malware and building tools. Therefore, I value:

pure Python code, so that I can easily distribute via PyPI package manager and/or PyInstaller executables
32- and 64-bit Intel architecture, Windows platform, and PE format analysis, because that’s what I usually deal with
flexibility to deal with both well-known and novel obfuscation
operations:
- parse file formats into structures
- read data from addresses, via RVA and EA / VA
- disassemble regions and functions
- navigate code and data cross references
- emulate sequences of instructions

While other languages and frameworks may have better performance, Python is a language that many of my colleagues are familiar with. I’m willing to tolerate performance within approximately an order of magnitude of IDA Pro or vivisect.

In the back of my mind, I’m wondering if we could port FLOSS to miasm so that we can get all the benefits of py3.

Installation

To install:

$ pip install https://github.com/cea-sec/miasm/zipball/master pyparsing

This installs miasm directly from the master branch on Github. You could install from PyPI like pip install miasm==0.1.3, but master has a few recent fixes for Windows/PE that look nice (1, 2, 3, 4).

Although miasm supports native libraries (LLVM, GCC, and Z3) to enable better performance, we’ll ignore that for now.

Parsing a PE file

note: after chatting with @serpilliere and @commial, I’ve learned that using miasm.jitter.loader and/or miasm.analysis.sandbox may be more idiomatic. Therefore, I’ve updated this section to include references to miasm.jitter.loader routines below.

You can load a PE file into miasm like this:

import miasm

with open('kernel32-32.dll', 'rb') as f:
    buf = f.read()

pe = miasm.analysis.binary.Container.from_string(buf)

I used Container.from_string() rather than Container.from_stream() because I often analyze samples that I never save to my hard drive (e.g. unpacked by a script or fetched via network). I wonder if using from_stream() consumes fewer resources because large files are not slurped into memory?

Accessing the entry point is easy:

assert pe.entry_point == 0x6b81fd70

This appears to be a property that is available across different file formats.

enumerating imports

Miasm provides low level access to the parsed PE file. If you’re familiar with it, then you can fetch the raw data:

dll = pe.executable.DirImport.impdesc[0]
assert dll.dlldescname.name.decode('ascii') == 'api-ms-win-core-rtlsupport-l1-1-0.dll'

imp = dll.impbynames[0]
assert imp.name.decode('ascii') == 'RtlCaptureStackBackTrace'

On the one hand, this requires more specific handling than something like IDAPython, where the following works for many file formats:

import idaapi

imports = []
for i in range(0, idaapi.get_import_module_qty()):
    dll = idaapi.get_import_module_name(i)
    if not dll:
        continue

    def imp_cb(ea, name, ord):
        if name:
            imports.append((dll, name))
        else:
            imports.append((dll, ord))
        return True

    idaapi.enum_import_names(i, imp_cb)

assert imports[0] == ('api-ms-win-core-rtlsupport-l1-1-0.dll', 'RtlCaptureStackBackTrace')

On the other hand, miasm lets you 1) easily access the raw data, and 2) specialize your handling of edge cases. I found this to be a recurring pattern with miasm: it provides low level tools that require a bit more effort up front, but also enable more flexible analysis.

If you’d prefer a higher-level API, then you can use the routines found in miasm.jitter.loader.pe. For example, get_pe_dependencies fetches the DLLs upon which the PE depends:

imported_dlls = miasm.jitter.loader.pe.get_pe_dependencies(pe.executable)
assert sorted(imported_dlls)[0] == 'api-ms-win-core-appcompat-l1-1-0.dll'

And get_import_address_pe computes the addresses of imported symbols:

imports = miasm.jitter.loader.pe.get_import_address_pe(pe.executable)
assert sorted(imports.keys)[0] == ('api-ms-win-core-appcompat-l1-1-0.dll', 'BaseCheckAppcompatCache')
assert imports[('api-ms-win-core-appcompat-l1-1-0.dll', 'BaseCheckAppcompatCache')] == 0x6b880c48

enumerating exports

Likewise, for exports, miasm provides the tools for you to parse the data structures semi-manually. In this case, we need to walk a couple lists, handling some edge cases (such as forwarded exports) along the way.

Here’s the wrapper function I’ve been using:

import collections
Export = collections.namedtuple("Export", ["ordinal", "names", "rva", "is_forwarded"])

def get_exports(pe):
    directory_entry_export = pe.executable.NThdr.optentries[miasm.loader.pe.DIRECTORY_ENTRY_EXPORT]
    export_directory_range = (directory_entry_export.rva, directory_entry_export.rva + directory_entry_export.size)

    # ref: https://github.com/cea-sec/miasm/blob/master/miasm/loader/pe.py#L740
    exported_names = [[] for _ in range(pe.executable.DirExport.expdesc.numberoffunctions)]
    for i, entry in enumerate(pe.executable.DirExport.f_names):
        exported_names[pe.executable.DirExport.f_nameordinals[i].ordinal].append(entry.name)

    exports = []
    for i, entry in enumerate(pe.executable.DirExport.f_address):
        if not entry.rva:
            continue

        exports.append(Export(
            ordinal=i + pe.executable.DirExport.expdesc.base,
            names=exported_names[i],
            rva=entry.rva,

            # forwarded exports can be identified by checking if 
            # the target address falls within the export table
            # (and it will dereference to an ASCII string).
            # ref: https://reverseengineering.stackexchange.com/a/21110/17194
            is_forwarded = export_directory_range[0] <= entry.rva < export_directory_range[1]
        ))
    return exports

You can use it like this:

assert str(get_exports(pe)[0]) == "Export(ordinal=1, names=[<DescName=b'BaseThreadInitThunk'>], rva=131424, is_forwarded=False)"

One thing I think is cool is how miasm’s approach (derived from their repr routine) encouraged me to prepare for multiple exported names to point to the same address (ordinals, too). See how the name field is a list of names, rather than single, inline string? Surely I would have eventually remembered this case the hard way, so I appreciate that the framework guided me away from a potential bug.

I don’t really like how some symbols are named inconsistently. For example, we see camel case in DirExport, snake case in f_name, smooshed words in numberoffunctions, and semi-shorted words in expdesc. I’ve been using a Jupyter notebook with tab completion heavily to understand what fields are availble.

Also, we can use the high-level routines in miasm.jitter.loader.pe to fetch exports:

exports = miasm.jitter.loader.pe.get_export_name_addr_list(pe.executable)
assert exports[0] == ('AcquireSRWLockExclusive', 0x6b89b22a)

enumerating sections

It’s easy to enumerate sections using the raw parsed structures:

for section in pe.executable.SHList.shlist:
    name = section.name.partition(b'\x00')[0].decode('ascii')
    va = section.addr
    size = section.size
    print(name, hex(va), hex(size))

I haven’t found a higher-level API that automatically decodes the section names, yet.

Conclusion

This is the first set of notes. Next, I’ll share how I learned to disassemble instructions in an executable and recognize functions. Maybe we’ll get to emulation (haven’t attempted this yet).

Overall, I’m happy with miasm for a few reasons:

the PE parser seems fairly thorough and exposes its internals to the user
documention and docstrings are sometime present, for example, see locationdb.py
code style is consistent and easy to understand
no bugs encountered yet