Now Reading
Reverse Engineering Protobuf Definitions From Compiled Binaries

Reverse Engineering Protobuf Definitions From Compiled Binaries

2024-03-09 14:21:24

A couple of years in the past I launched protodump, a CLI for extracting full supply protobuf definitions from compiled binaries (whatever the goal structure). This may turn out to be useful for those who’re attempting to reverse engineer an API utilized by a closed supply binary, for example. On this submit I’ll clarify the way it works, however first, a demo:

Demo of protodump

How does it work?

To know the way it works, lets check out a small take a look at.proto instance:

syntax = "proto3";

possibility go_package = "./;helloworld";

message HelloWorld {
  string title = 1;
}

If we compile this with protoc to golang we’ll get some golang code that defines the article kind, creates getters and setters for the title area, and so forth. We are able to use it as follows:

func primary() {
	obj := helloworld.HelloWorld{
		Title: "myname",
	}

	fmt.Printf("%sn", obj.GetName())
}

Nevertheless protobuf additionally helps runtime reflection. Moderately than invoking the getter technique at compile time, we will fetch the record of fields and question them at runtime:

func primary() {
	obj := helloworld.HelloWorld{
		Title: "myname",
	}

	fields := obj.ProtoReflect().Descriptor().Fields()
	for i := 0; i < fields.Len(); i++ {
		area := fields.Get(i)
		worth := obj.ProtoReflect().Get(area).String()
		fmt.Printf("Subject %d has worth '%v'n", i, worth)
	}
}
$ go run primary.go
Subject 0 has worth 'myname'

How can the generated golang code know the sector names and kinds at runtime like this? The protoc compiler shops a complete copy of the protobuf definition within the generated output code. Here is the whole protoc output for our HelloWorld message kind, and particularly, lines 72-78 retailer this protobuf definition:

var file_test_proto_rawDesc = []byte{
	0x0a, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x2e, 0x70, 0x72, 0x6f, 0x74, 0x6f, 0x22, 0x20, 0x0a, 0x0a,
	0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x57, 0x6f, 0x72, 0x6c, 0x64, 0x12, 0x12, 0x0a, 0x04, 0x6e, 0x61,
	0x6d, 0x65, 0x18, 0x01, 0x20, 0x01, 0x28, 0x09, 0x52, 0x04, 0x6e, 0x61, 0x6d, 0x65, 0x42, 0x0f,
	0x5a, 0x0d, 0x2e, 0x2f, 0x3b, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x62,
	0x06, 0x70, 0x72, 0x6f, 0x74, 0x6f, 0x33,
}

This byte array shops the sector names and kinds, messages, companies, enums, choices, and so forth. It’s somewhat meta as a result of the format of this object is itself a protobuf object, referred to as a FileDescriptor, and is encoded right into a byte array utilizing the protobuf wire format.

With this data in hand, the technique for extracting protobuf definitions from binaries turns into the next:

  • Iterate over the contents of a program binary
  • Discover sequences of bytes that appear to be they may be FileDescriptors, akin to the instance above
  • Extract these bytes and decode them into “.proto” supply definitions

Discovering bytes that appear to be FileDescriptors

To seek out FileDescriptors I take the naive strategy of merely looking this system binary for the ascii string “.proto”. The FileDescriptor object has a area for the file name of the proto file it was compiled from, so if engineers are naming their information with a “.proto” extension then it’ll be current within the output.

We are able to think about a program binary as a sequence of bytes laid out as follows:

Program memory

So once we discover a “.proto” string, to seize your entire FileDescriptor (your entire purple section) we have to first transfer backward to the beginning of the article after which learn till the tip.

To find out how far again to learn, it’s useful to grasp the protobuf wire format. Protobuf makes heavy use of variable-length integers (“varints”), which permit encoding unsigned 64-bit integers utilizing anyplace between 1-10 bytes (in little-endian), with smaller integers utilizing fewer bytes. When such a varint is encountered, if probably the most important little bit of a byte is ready then this means that the next byte can be a part of the varint:

# Worth is 8:
  00001000
# ^ MSB just isn't set, finish of varint

# Worth is 150:
  10010110 00000001
# ^ MSB is ready, varint continues to subsequent byte
#          ^ MSB just isn't set, finish of varint
#  calculate 150:
# 10010110 00000001       // Unique inputs
# 0010110  0000001        // Drop continuation bits
# 0000001  0010110        // Convert to big-endian
# 00000010010110          // Concatenate
# 128 + 16 + 4 + 2 = 150  // Interpret as an unsigned 64-bit integer

Protobuf Messages are encoded utilizing a “Tag-Size-Worth” scheme, the place a message with some fields is encoded as the next construction, repeated:

  • A varint for the index and kind of the sector (the “tag”)
    • That is outlined as the sector variety of a area inside a message, bit-shifted left 3 occasions and OR-ed with the kind. Protobuf defines 6 types, with string varieties having worth 2
  • A varint for the byte-length of the payload
  • The payload itself

and this will get repeated for each area within the message. Utilizing the byte array from the HelloWorld instance above, we have now the next construction:

Annotated file descriptor

So the search technique is:

  • Loop over program reminiscence on the lookout for the ascii string “.proto”. After we discover one:
    • Assume that that is the beginning of an encoded file descriptor object. Transfer again to the earlier 0x0a byte (the tag for the file title area)
    • If the file title is strictly 10 bytes lengthy, transfer again 1 byte additional (in any other case the 0x0a byte we discovered is definitely the string size and never the tag)
    • Now that we’re firstly of the FileDescriptor object, maintain consuming bytes as long as they’re a sound protobuf wire encoding
    • Take all of the bytes we’ve consumed and try to unmarshal them right into a FileDescriptor object
      • If profitable, convert the FileDescriptor object to a supply “.proto” file and output it

To transform the FileDescriptor object to a supply “.proto” file, I couldn’t discover any current code within the protoc compiler to do this so I wrote my own implementation.

Lastly, for unit testing, I wrote a small harness that takes proto information as enter, executes the protoc compiler on them, takes that FileDescriptor output and reserializes it as proto, and checks that the enter proto and output proto are byte-for-byte equivalent.

Shortcomings

There are a variety of limitations to this strategy. Before everything, all the things written above is restricted to Google’s protoc compiler; it doesn’t apply to the extra normal protobuf specification. If somebody makes use of a non-protoc compiler, it could have a totally completely different mechanism for implementing reflection.

Even when utilizing protoc:

  • Folks can title their information with an extension apart from “.proto”
  • They will obfuscate the file descriptor in program reminiscence
  • Protobuf explicitly does not guarantee area ordering on the wire format, so shifting the file title area to a special location apart from the beginning of the FileDescriptor would break the scanning

Moreover many protobuf compilers supply the choice to suppress this embedding fully (at the price of dropping runtime reflection capabilities).

Regardless of all these shortcomings, I’ve discovered that the 99% of binaries I study use protoc and don’t have any obfuscation, and all their protobuf definitions are extracted in full.

P.S. For those who get pleasure from this sort of content material be happy to comply with me on Twitter: @arkadiyt



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top