With the turning of the decade, Python 2.7 is unmaintained. This is as good time as any to review the frameworks and libraries that we rely upon - and possibly doing updates or refreshes. For many years, I've reached for vivisect when I needed to programmatically disassemble a file. Its pure Python (so easy distributable) and provides many features out of the box (disassembler, emulator, debugger, and symbolic analysis). Unfortunately, its strictly py2 with few signs of change.
miasm seems like a promising alternative that supports Python 3.x. In their words:
Miasm is a free and open source (GPLv2) reverse engineering framework. Miasm aims to analyze / modify / generate binary programs. Here is a non exhaustive list of features:
- Opening / modifying / generating PE / ELF 32 / 64 LE / BE
- Assembling / Disassembling X86 / ARM / MIPS / SH4 / MSP430
- Representing assembly semantic using intermediate language
- Emulating using JIT (dynamic code analysis, unpacking, …)
- Expression simplification for automatic de-obfuscation
I'm going to explore miasm and see if it offers features that I'd find useful as a reverse engineer. As I have time, I'll post notes, code snippets, impressions, and conclusions.
Things I probably want
I work as a reverse engineer on the FLARE team analyzing malware and building tools. Therefore, I value:
- pure Python code, so that I can easily distribute via PyPI package manager and/or PyInstaller executables
- 32- and 64-bit Intel architecture, Windows platform, and PE format analysis, because that's what I usually deal with
- flexibility to deal with both well-known and novel obfuscation
- parse file formats into structures
- read data from addresses, via RVA and EA / VA
- disassemble regions and functions
- navigate code and data cross references
- emulate sequences of instructions
While other languages and frameworks may have better performance, Python is a language that many of my colleagues are familiar with. I'm willing to tolerate performance within approximately an order of magnitude of IDA Pro or vivisect.
In the back of my mind, I'm wondering if we could port FLOSS to miasm so that we can get all the benefits of py3.
$ pip install https://github.com/cea-sec/miasm/zipball/master pyparsing
This installs miasm directly from the master branch on Github.
You could install from PyPI like
pip install miasm==0.1.3, but master has a few recent fixes for Windows/PE that look nice
Although miasm supports native libraries (LLVM, GCC, and Z3) to enable better performance, we'll ignore that for now.
Parsing a PE file
note: after chatting with @serpilliere and @commial, I've learned that using
miasm.analysis.sandboxmay be more idiomatic. Therefore, I've updated this section to include references to
You can load a PE file into miasm like this:
import miasm with open('kernel32-32.dll', 'rb') as f: buf = f.read() pe = miasm.analysis.binary.Container.from_string(buf)
Container.from_string() rather than
Container.from_stream() because I often analyze samples that I never save to my hard drive (e.g. unpacked by a script or fetched via network).
I wonder if using
from_stream() consumes fewer resources because large files are not slurped into memory?
Accessing the entry point is easy:
assert pe.entry_point == 0x6b81fd70
This appears to be a property that is available across different file formats.
Miasm provides low level access to the parsed PE file. If you're familiar with it, then you can fetch the raw data:
dll = pe.executable.DirImport.impdesc assert dll.dlldescname.name.decode('ascii') == 'api-ms-win-core-rtlsupport-l1-1-0.dll' imp = dll.impbynames assert imp.name.decode('ascii') == 'RtlCaptureStackBackTrace'
On the one hand, this requires more specific handling than something like IDAPython, where the following works for many file formats:
import idaapi imports =  for i in range(0, idaapi.get_import_module_qty()): dll = idaapi.get_import_module_name(i) if not dll: continue def imp_cb(ea, name, ord): if name: imports.append((dll, name)) else: imports.append((dll, ord)) return True idaapi.enum_import_names(i, imp_cb) assert imports == ('api-ms-win-core-rtlsupport-l1-1-0.dll', 'RtlCaptureStackBackTrace')
On the other hand, miasm lets you 1) easily access the raw data, and 2) specialize your handling of edge cases. I found this to be a recurring pattern with miasm: it provides low level tools that require a bit more effort up front, but also enable more flexible analysis.
If you'd prefer a higher-level API, then you can use the routines found in
get_pe_dependencies fetches the DLLs upon which the PE depends:
imported_dlls = miasm.jitter.loader.pe.get_pe_dependencies(pe.executable) assert sorted(imported_dlls) == 'api-ms-win-core-appcompat-l1-1-0.dll'
get_import_address_pe computes the addresses of imported symbols:
imports = miasm.jitter.loader.pe.get_import_address_pe(pe.executable) assert sorted(imports.keys) == ('api-ms-win-core-appcompat-l1-1-0.dll', 'BaseCheckAppcompatCache') assert imports[('api-ms-win-core-appcompat-l1-1-0.dll', 'BaseCheckAppcompatCache')] == 0x6b880c48
Likewise, for exports, miasm provides the tools for you to parse the data structures semi-manually. In this case, we need to walk a couple lists, handling some edge cases (such as forwarded exports) along the way.
Here's the wrapper function I've been using:
import collections Export = collections.namedtuple("Export", ["ordinal", "names", "rva", "is_forwarded"]) def get_exports(pe): directory_entry_export = pe.executable.NThdr.optentries[miasm.loader.pe.DIRECTORY_ENTRY_EXPORT] export_directory_range = (directory_entry_export.rva, directory_entry_export.rva + directory_entry_export.size) # ref: https://github.com/cea-sec/miasm/blob/master/miasm/loader/pe.py#L740 exported_names = [ for _ in range(pe.executable.DirExport.expdesc.numberoffunctions)] for i, entry in enumerate(pe.executable.DirExport.f_names): exported_names[pe.executable.DirExport.f_nameordinals[i].ordinal].append(entry.name) exports =  for i, entry in enumerate(pe.executable.DirExport.f_address): if not entry.rva: continue exports.append(Export( ordinal=i + pe.executable.DirExport.expdesc.base, names=exported_names[i], rva=entry.rva, # forwarded exports can be identified by checking if # the target address falls within the export table # (and it will dereference to an ASCII string). # ref: https://reverseengineering.stackexchange.com/a/21110/17194 is_forwarded = export_directory_range <= entry.rva < export_directory_range )) return exports
You can use it like this:
assert str(get_exports(pe)) == "Export(ordinal=1, names=[<DescName=b'BaseThreadInitThunk'>], rva=131424, is_forwarded=False)"
One thing I think is cool is how miasm's approach
(derived from their repr routine)
encouraged me to prepare for multiple exported names to point to the same address (ordinals, too).
See how the
name field is a list of names, rather than single, inline string?
Surely I would have eventually remembered this case the hard way, so I appreciate that the framework guided me away from a potential bug.
I don't really like how some symbols are named inconsistently.
For example, we see camel case in
DirExport, snake case in
f_name, smooshed words in
numberoffunctions, and semi-shorted words in
I've been using a Jupyter notebook with tab completion heavily to understand what fields are availble.
Also, we can use the high-level routines in
miasm.jitter.loader.pe to fetch exports:
exports = miasm.jitter.loader.pe.get_export_name_addr_list(pe.executable) assert exports == ('AcquireSRWLockExclusive', 0x6b89b22a)
It's easy to enumerate sections using the raw parsed structures:
for section in pe.executable.SHList.shlist: name = section.name.partition(b'\x00').decode('ascii') va = section.addr size = section.size print(name, hex(va), hex(size))
I haven't found a higher-level API that automatically decodes the section names, yet.
This is the first set of notes. Next, I'll share how I learned to disassemble instructions in an executable and recognize functions. Maybe we'll get to emulation (haven't attempted this yet).
Overall, I'm happy with miasm for a few reasons:
- the PE parser seems fairly thorough and exposes its internals to the user
- documention and docstrings are sometime present, for example, see locationdb.py
- code style is consistent and easy to understand
- no bugs encountered yet