Parsing Binary Data with `vstruct`
vstruct is a pure Python module for parsing and serializing binary data. It is a submodule of the
vivisect project for binary analysis developed by Invisig0th Kenshoto.
vstruct been developed and tested over many years, and remains integral a number of production systems. Its also simple and fun to learn!
vstruct is often a better choice than manually writing imperative scripts using the
struct module. Code developed using the former module tends to be heavily declarative, which removes the much of the boilerplate code typically required when writing binary parsing code. Declarative code emphasizes the imporant aspects of binary parsing: offsets, sizes, and types. This makes maintenance of
vstruct-based parsers easier in the long run.
vstruct is a part of the
vivisect project. It is currently Python 2.7 compatible, although a development fork exists for Python 3.x
vivisect nor its subprojects are distributed with
setup.py files, so you should download the
vstruct source directory into one of your Python path directories (such as the current directory):
Declaring a dependency on
vstruct from a Python module via
setup.py is a bit tricky, so I maintain a PyPI-mirrored package named
vivisect-vstruct-wb. This allows you to use
pip to install
I’ve also updated this mirror to support both Python 2.7 and Python 3.0 interpreters, so you can probably use the
vivisect-vstruct-wb mirror for many of your future projects. Always refer to Visi’s definitive source on GitHub before reporting bugs.
Here’s a “Hello World!” script that uses
vstruct to parse a little endian 32-bit unsigned integer from a byte string:
Note how you first create an instance of the
v_uint32 type, parse a byte string using the
.vsParse() method, and then treat the result like a native Python type instance. To be extra safe, I like to explicitly convert the parsed object into a true Python type:
vstruct-specific operations are defined as methods with the
vs prefix. You can find these methods on (almost) all
vstruct-derived parsers. Although I most commonly use
.vsSetLength(), its good to be familiar with all the operations. Here’s a short summary of each:
.vsParse()- parse instance from a byte string.
.vsParseFd()- parse instance from file-like object (must have
.vsEmit()- serialize instance into a byte string.
.vsSetValue()- set instance’s data from a native Python instance.
.vsGetValue()- get copy of instance’s data as a native Python instance.
.vsSetLength()- set length of array type, such as
Trueif instance is a simple “primitive” type.
.vsGetTypeName()- get string containing name of instance’s type.
.vsGetEnum()- fetch associated
v_numberinstance, if it exists.
At this point,
vstruct probably seems like an over-engineered clone of
struct.unpack, so lets dive into a cooler feature.
vstruct parsers are typically class-based. The module provides a set of primitive datatypes (like
v_wstr for DWORD and wide strings, respectively) and a mechanism for combining them into complex datatypes (
VStructs). First, here are the primitive types:
vstruct.primitives.v_int8- signed integer.
vstruct.primitives.v_uint8- unsigned integer.
vstruct.primitives.v_bytes- sequence of raw bytes with explicit length.
vstruct.primitives.v_str- ASCII string with explicit length.
vstruct.primitives.v_wstr- wide string with explicit length.
vstruct.primitives.v_zstr- ASCII string with NULL terminator.
vstruct.primitives.v_zwstr- wide string with NULL terminator.
vstruct.primitives.v_enum- intepretation for integer types.
vstruct.primitives.v_bitmask- interpretation for integer types.
Complex parsers are developed by defining subclasses of the
vstruct.VStruct class that contain member variables that are instances of
vstruct primitives or other complex
VStruct types. Whoa! Lets digest that sentence part by part.
Complex parsers are developed by defining subclasses of the
In this example, we define the PE header of a Windows executable file using
vstruct. The name of our parser is
IMAGE_NT_HEADERS, and it inherits from the class
vstruct.VStruct. We have to explicitly invoke the super constructor in our
__init__() method; we can use either style:
…that contain member variables that are instances of
The first member variable of a
IMAGE_NT_HEADERS instance is a
v_bytes instance that holds four bytes.
v_bytes are commonly used for raw byte sequences that don’t get parsed further. In this example, the the
.Signature member variable will hold the magic sequence “PE\x00\x00” when parsed from a valid PE.
Additional member variables can be added to the class definition to parse a sequence of fields from binary data.
VStruct classes track the order of declaration of member variables, and handle all other associated bookkeeping. Your only remaining job is to decide which types to use in which order. Easy!
When structure share common sub-structures, you can extract them into reusable
VStruct definitions that behave just like
[Complex parsers are developed by defining classes that contain] other complex
VStruct instance parses binary data and encounters a complex member variable, it recurses into the subparser. In this example, the
.FileHeader member variable is a complex type defined here. The
IMAGE_NT_HEADERS parser will first consumer four bytes for the
.Signature field, and then pass parsing control to the
IMAGE_FILE_HEADER complex parser. We’d need to inspect that class’s definition to determine its size and layout.
I recommend developing
VStruct classes that each describe a small segment of a file format, and combine them together using a master top level
VStruct. This makes it easier to debug, as each fragment of the parser can be independently verified. Anyways, once you’ve defined a
VStruct, you can parse data with it using the pattern described at the beginning of the document:
During command 9, we open a sample PE file and read its contents into a byte string. During command 10, we show a short hexdump of the start of the PE header. During command 11, we create an instance of the
IMAGE_NT_HEADERS class, but note that it doesn’t yet contain any parsed data. We explicitly parse a byte string containing the PE headers on command 12. During commands 13 and 14, we demonstrate accessing members of the parsed instance. Note that when we access an embedded complex
VStruct, we can continue to index into it, but when we access a primitive member, we get its native Python representation. That’s really convenient!
While debugging, we can use the
.tree() method to print a human-readable representation of the parsed data:
Advanced topics in
VStruct’s layout is declared in the type’s
__init__() constructor method, it can react to parameters and optionally include members. For example, a
VStruct that behaves differently in 32- or 64-bit environments might look something like:
This is a very powerful technique, though its a bit tricky to get right. Its important to understand when they layout is finalized, and when it is evaluated and used to parse binary data. When
__init__() is called, the instance does not have access to the data it will be parsing. Member variables only get populated with parsed data once
.vsParse() is called. Therefore, a
VStruct constructor cannot refer to the contents of a member instance to decide how to continue parsing. For example, the following DOES NOT WORK:
To properly handle dynamic parsers, we need to use
vstruct callbacks. When a
VStruct instance finish parsing a member field, it checks to see if the class has a specially named method prefixed with
pcb_ (Parser Call Back), and invokes it. The remainder of the method name is the name of the just-parsed field; for example, once
BazDataRegion.data_size is parsed, the method named
BazDataRegion.pcb_data_size would be invoked, if it existed.
This is important because when the callback is invoked, the
VStruct instance is partially populated with parsed data. For example:
This means we can defer the final initialization of a class’s layout until some binary data has been parsed. Here’s the correct way of implementing a sized buffer:
During command 19, we declare a structure that has a header field (
.data_size) that describes the size of subsequent raw data (
.data_data). Since we do not have the parsed header value when
.__init__() is called, we use a callback named
.pcb_data_size() that will be invoked as soon as the
.data_size field is parsed. When the callback executes, it updates the size of the
.data_data byte array so that it consumes the correct number of bytes. During command 20 we create an instance of the parser, and on command 21 parse a byte string. Although we pass in 13 bytes, we expect only six bytes to be consumed: four by the
.data_size uint32, and two for the
.data_data byte array. The remaining bytes are not processed. During command 22 we confirm that the parser correctly interpreted the binary data.
Note that during the
.pcb_data_size() callback, we accessed the
VStruct instance object named
.data_data by using square brackets. This is because we want to modify that sub-instance itself, and not fetch the concrete parsed value from that sub-instance. It takes a little practice to figure out which technique to use (
self["field0"].xyz), but usually if you want a concrete parsed value, avoid square brackets. Here, we did not.
vstruct is a powerful module for developing modular and maintainable binary parsers. It removes much of the boilerplate code from the development process. I’ve enjoyed using
vstruct to parse malware C2 protocols, database indexes, and binary XML files. I recommend you review working
vstruct parsers from the following projects: