Posts

Showing posts with the label decompilation

Boomerang 2

I was recently asked what I would do differently if I was writing a decompiler from scratch today. Having worked on decompilers for the past 9 years myself and worked with others who have worked on decompilers for longer still, I like to think we now know enough about it to recognise some of the mistakes we've made and identify some areas of improvement. The primary difficulty of decompilation is that most analysis requires interprocedural information. Most everyone recognises this need early on so there's an assumption that you need to keep the intermediate representation for the entire program in memory at the same time. For many years we've had a hell of a lot of memory to play with (4 gig will be standard on desktop PCs next year) and with 64 bit architectures and virtual memory you can swap to disk for as much memory as you need. So it would appear that structuring a decompiler like a compiler, with intermediate disk formats for each compilation module is unnecessa...

Manual Decompilation

Argh. It's 2006, and I still don't have a good decompiler. All is not lost. Thankfully, there are still interesting things to decompile that are both small and contain lots of stuff that makes decompilation easy (e.g., symbols, relocations). So, let's do it manually using some trustworthy old fashioned tools: a disassembler, a text editor and some string processing tools. Let's choose a target. I'm going to go with a linux kernel module because they are small, contain symbols and relocations and because there exist GPL ones that I won't get in trouble for reverse engineering publicly. Just choosing something at random from /lib/modules on my Ubuntu linux box I come across new_wlan_acl.ko from the madwifi-ng drivers. Right, now we need a disassembly. No problem. Just do objdump -d --adjust-vma=0x8000000 new_wlan_acl.ko > out.dis . That almost gives me the object as it would look mapped into the linux kernel. Slight problem though, none of the reloca...

Into The Unknown..

Image
Back when Mike and I started the Boomerang decompiler project, we had the choice to start from scratch or to use our work on the UQBT project as a base. There was pros and cons to both approaches, but in the end we decided that UQBT had quite a lot of value in it, so we went with that. Unfortunately, at the time Mike and I started working on a decompiler based on UQBT, the copyright was owned by a number of parties. We had to talk to lawyers from the University of Queensland, Sun Microsystems and the individual people who had worked on the codebase over the years. Initially the lawyers were of a single opinion: no way. The individual contributors, however, were of mixed opinions: what for? and which license? As it turned out, answering the second question also answered the first question and started to put the lawyers minds at ease. The goal of the Boomerang decompiler project was decided, way back in 2001, to be a guiding force in bringing about a market for decompilation techn...

I havn't forgotten you

Havn't been working on Boomerang much lately. I made a few changes to make the exports of DLL files appear as entrypoints in the GUI. Seems to work, will check it in soon. I also made some recent changes that handle gotos inside switches better. This means that magic like Duff's Device now decompile sensibly.

And then, some calls just don't return

A small number of library procedures are "called" but never actually return. Eventually I'd like to have a way to specify these procedures with anotations in the signature files, but for the moment they are hard coded in the frontend. That's acceptable for the moment as there is only five: _exit, exit, ExitProcess, abort and _assert . Thing is, what happens when you have a branch over one of these calls, as you often do. Typically you end up with phi statements that refer to the call or the statements before it because there's no way to show that these definitions are killed by the call but not defined by it. We could add this to the SSA form but a simpler solution is available. Whenever the decompiler determines that the destination of a call is one of these no return procedures then we simply drop the out edge of the containing basic block. Without an out edge the definitions cannot flow beyond the call. Using dataflow based type analysis and some of m...

MinGW's tricky prologue code

Continuing with my ongoing test program extract_kyra.exe from the scummvm tools I've been looking at the very first call in main . It would appear that this exe was compiled with stack checking runtime code enabled. That very first call is to a short little procedure that takes a single parameter in the eax register; the number of bytes to subtract from esp . Here's a disassembly of the procedure: push ecx mov ecx, esp add ecx, 8 loc_405336: cmp eax, 1000h jb short loc_40534D sub ecx, 1000h or dword ptr [ecx], 0 sub eax, 1000h jmp short loc_405336 loc_40534D: sub ecx, eax or dword ptr [ecx], 0 mov eax, esp mov esp, ecx mov ecx, [eax] mov eax, [eax+4] jmp eax It not only s...

Chronicle of a Decompilation

Image
I've been promising to do this for a while, so here goes. I have downloaded a small open source utility called extract_kyra which is part of the scummvm tools . It doesn't really matter what the utility does. In fact, I prefer not to know as it gives me an unfair advantage compared to, say, decompiling malware. What's important is that this tool is under the GPL, so it is permisible for me to decompile it. I will be writing a number of these posts that describe all the issues I've run into and the steps I've completed to produce the decompilation. I promise that I will not go and download the source code for this program until such time as I believe my decompilation is "complete". The first problem with decompiling this program is that Boomerang has failed to find main() . I can tell this because the load phase of the workflow does not list main as a found entrypoint, it only lists start . This is because this exe was compiled with MinGW and it...

Bi-directional Dataflow Type Inference

After reading this paper I've starting implementing a brand new type analysis in Boomerang. The first step is calculating reverse dominance frontiers. A regular dominance frontier is the set of nodes that have a predecessor that is dominated by the given node, but is not itself dominated. These are the nodes where phi statements are to be placed when transforming a program into SSA form. The reverse dominance frontier is much the same, except it is the successors that must be post -dominated. Boomerang already calculates immediate post-dominators for use by the code generation phase, but we've never before had a use for reverse dominator frontiers. The paper describes an extension to SSA form called static single information form (SSI) which introduces a new kind of assignment: the sigma function statement which are placed on the reverse dominator frontiers. The purpose of this new statement is to split definitions before uses on different code paths. I will be using...

Short Circuit Analysis

I've checked in some code that detects branches to branches and merges them if they meet some basic requirements. As such, Boomerang can now generate code like: if (a // x } else { // y } instead of generating a goto statement. I've checked in a couple of test programs that exercise this analysis. I havn't looked at how this analysis effects loops or more obscure control structures.. so there could well be bugs here.

Unfinished Assortments

I received an email of inspiration from Mike a week ago outlining how similar conjunctions and disjunctions in short circuited if statements are at the binary level. After looking at the problem myself I found it was a pretty simple problem. If one of the two out edges of a branch node is another branch node which contains only a branch statement, and the destination of that statement is shared with the first branch statement then you can merge the condition of the second branch into the condition of the first branch. Exactly what combination of fall/taken edges are shared by the two branches is what determines whether an || or a && should be used. This is a pretty easy transformation to do just before code generation (or just after decompilation) and I'm about half way through implementing it. Unfortunately I got sidetracked. My work, you know, the people who pay me, they have me doing - wait for it - Java development. I moaned about not just being a one trick pony ...

Order of Decompilation

Which order you decompile the procedures of a program in can determine how much information you have available to make good analysis. In Boomerang we've developed a system of depth first search of the call graph, which takes into account recursion, to have the most information about parameters available when needed. For example, if a program consists of two procedures, one called by the other, it makes the most sense to decompile the procedure which is the leaf in the call graph so that the procedure that calls it has the most information about what parameters it needs to pass to it. What happens if the way the leaf procedure takes a parameter that is a pointer to a compound type? By the parameter's usage the decompiler may be able to determine that it is a pointer, it might even be able to determine that it is a pointer to a compound, but unless every member in the compound is accessed in a way that restricts that member's type sufficiently, the type that the decompiler...

Binary Release Alpha 0.3

I've added a new binary release of Boomerang for win32 to the sourceforge project. The Linux binaries are also up. If you'd like to try making a release for some other platform, please let me know. What can the new release do? Well, it crashes less. It supports ELF .o files much better than previous releases. It includes some changes that make for better output in some instances. Overall, just general improvements. What can't the new release do? This is still an alpha release. That means we don't expect you to be able to do very much work with it. Running it on /bin/ls will still give you horrendous output, but try /bin/arch.

So Many Types and No (Good) Type Analysis

The type analysis in Boomerang remains a nice big mess. There have been three attempts at a type analysis system: dataflow based, constraint based and adhoc. At the moment the adhoc gives the best results, and the other two crash, a lot. Sometimes there is an absolute orgy of types in an input program, and the type analysis will assign the type void to a local. I've just added some more adhoc type analysis that will recognise when the programmer is assigning the elements of an object of unknown type to an object of known type and copy the known types for the elements to the unknown type. This is very specific but hopefully it occurs in more than just the one input program I was looking at. In C the programmer would have written something like this: struct foo *p = someFuncThatReturnsAFoo(); p->name = global.name; p->count = global.count; p->pos = global.pos; p->other = 0; If that call is to a library proc we will have the struct foo , and know that p is a pointe...

Conflation of Pointer and Array Types

A common source of confusion for new C programmers is the conflation of pointers and arrays that C does. I often think of the dynamic semantics of the language when I'm thinking deeply about passing arrays to functions. Typically, you can tell an experienced programmer that C always passes arrays by reference, never by value, and they won't go wrong. Not all languages are like this, so in Boomerang we try to represent pointers and arrays as seperate non-conflated types. In our type system an array is a type used to describe those bytes in memory that contain a finite number of objects of a particular base type. Similarly, a pointer is a type used to describe those bytes in memory that contain an address, which if followed will reveal a single object of a particular base type. As such, it is necessary to refer to somethings explicitly using the Boomerang type system that are typically implied by the C type system. For example, a C string is often written in the C type syste...

Reverse Strength Reduction

Strength reduction is a compiler optimisation that tries to replace expensive operations involving a loop variant with less expensive operations. The most common example of strength reduction is the replacement of array indexing with pointer arithmetic. Consider this program fragment: for (int i = 0; i < 25; i++) arr[i] = 9; where arr is an array of integers. An unoptimised compilation, in an RTL form, might look like this: 1 *32* r25 := 0 2 BRANCH 6 Highlevel: r25 >= 25 3 *32* m[a[arr] + r25 * 4] := 9 4 *32* r25 := r25 + 1 5 GOTO 2 which would actually be some very nice RTL for the decompiler to discover because it is easy to recognise that arr is an array with stride 4 and replace statement 3 with: 3 *32* a[r24] := 9 unfortunately, the compiler will have gotten to the RTL before we have, and most any compiler will do some strength reduction to get rid of that multiply in the middle of the left of statement 3. So what the decompiler will see is more like this: 1 *32* r...

Forcing a Signature

A while ago I added a bool to the Signature class that allowed the user to specify that the signature was already fully formed and did not need any processing. This was part of the "symbol file" hack that we used to do procedure-at-a-time decompilation using the command line tool. I noticed today that we were not honouring the forced bit anymore, for example, we were removing unused parameters and returns from the signature, so I fixed that. It occured to me that any procedure we discover via a function pointer was an excellent candidate for setting the forced bit on. The results were pretty spectacular as locals and globals were soon inheriting type information from the parameters of the signature. Unfortunately, the same could not be said of returns. In particular, consider some code like this: mystruct *proc5() 12 { *v** r24 } := malloc(343) 13 m[r24{12} + 4] := "foo" 14 RET r24{12} It's pretty clear that any local we create for r24 should be of type m...

Types, Globals and Varargs

I have a sample input program that has some code similar to this: 228 { *32* r24, *32* r28 } := CALL knownLibProc( .. arguments .. ) .. 307 *32* m[r24{228}] := 232 308 *32* m[r24{228} + 4] := 91 309 *32* m[r24{228} + 8] := "some string" where knownLibProc returns a pointer to a struct in r24. Early in the decompilation this type will be propogated into the statements in 307, 308 and 309 producing: 307 *32* m[r24{228}].size := 232 308 *32* m[r24{228}].id := 91 309 *32* m[r24{228}].name := "some string" our intermediate representation doesn't have an operator equivalent to C's -> operator, the above is more like writing (*p).size, but the backend takes care of that and will emit a -> instead. Unfortunately I was getting an assert fault before I even get to that. The problem was that the 228 instance of r24 was being assigned a local variable, and that local was not inheriting the return type of the call. So the adhoc type analysis would take a look a...

Debugging the Decompiler

One of the most useful features of the new GUI will be the ability to step through a decompilation and inspect the RTL at each step. To date I have implemented a Step button that allows the user to inspect a procedure before each major phase of the decompilation on that procedure. In the future, I intend to add more debugging points, perhaps even to the resolution of a single RTL change. I expect that some way for the user to specify the desired level of resolution will be required. Whether that is a bunch of menu options, or a spinner or even multiple Step buttons (step to next change, step to next analysis, step to next phase, step to next procedure, etc), I havn't decided. The UI already has a course form of breakpoints. At the decoding phase you can specify which procedures you want to inspect, and the decompilation will run without stopping until it gets to one of those procedures. It would be sensible to allow the user to set a breakpoint on a particular line of the RTL...

Displaying Pretty RTL

I did some fixes to the html output option in the various print() functions of Boomerang today. This is all so I can display RTL as pretty as possible. I'm thinking that hovering the mouse over a RefExp should highlight the line that is referenced by it. That's my first goal, and then I'll work on a context menu. All this is possible because I can hide pointers in the html output which refer back to the intermediate representation. Qt4 has the advantage that good html display widgets are standard parts of the toolkit. What I don't intend to do is to allow freeform manipulation of the RTL. That would require an RTL parser, which I'm simply not in the mood to write, at least this month.

Support for Linux Kernel Modules

Boomerang can now load Linux kernel modules and find the init_module and cleanup_module entrypoints. Loading ELF sections is a tricky business. I now honour the alignment information, which means kernel modules will load into Boomerang with the same section placement as they are loaded into IDA Pro. There was also some problems with the way we handle R_386_PC32 relocations. Checking the type of the symbol indicated by the relocation solves the problem. I also managed to speed it up significantly by removing an unnecessary check. Hopefully it really is unnecessary. My globals-at-decode-time code is now checked in. I await howls of disapproval on the mailing list, but hey, I do so every time I check in.