• unlimited access with print and download
    $ 37 00
  • read full document, no print or download, expires after 72 hours
    $ 4 99
More info
Unlimited access including download and printing, plus availability for reading and annotating in your in your Udini library.
  • Access to this article in your Udini library for 72 hours from purchase.
  • The article will not be available for download or print.
  • Upgrade to the full version of this document at a reduced price.
  • Your trial access payment is credited when purchasing the full version.
Buy
Continue searching

Reverse engineering of data structures from binary

ProQuest Dissertations and Theses, 2011
Dissertation
Author: Zhiqiang Lin

v TABLE OF CONTENTS Page LIST OF TABLES ................................. ix LIST OF FIGURES ................................ x ABSTRACT .................................... xii 1 INTRODUCTION ............................... 1 1.1 Dissertation Statement .......................... 1 1.2 Why Data Structure Reverse Engineering is Important .......... 2 1.3 Why Data Structure Reverse Engineering is Challenging ........ 5 1.4 Contributions .............................. 7 1.5 Scope of this Dissertation ........................ 8 1.6 Dissertation Overview .......................... 9 2 A DATA STRUCTURE REVERSE ENGINEERING FRAMEWORK .... 12 2.1 REWARDS ............................... 13 2.2 SigGraph ................................. 15 2.3 DIMSUM ................................ 18 3 REWARDS:AUTOMATICREVERSEENGINEERINGOF DATASTRUCTURE DEFINITIONS ................................. 21 3.1 Overview ................................. 21 3.1.1 Key Techniques ......................... 21 3.1.2 A Working Example ....................... 22 3.2 Detailed Design ............................. 24 3.2.1 Type Sinks ............................ 24 3.2.2 Online Type Propagation and Resolution Algorithm ....... 26 3.2.3 Off-line Type Resolution .................... 31 3.2.4 Typed Variable Abstraction ................... 31

vi Page 3.2.5 Constructing Hierarchical View of In-Memory Data ....... 32 3.3 Implementation ............................. 33 3.4 Evaluation ................................ 34 3.4.1 Effectiveness ........................... 35 3.4.2 Performance Overhead ..................... 38 3.5 Summary ................................. 39 4 SIGGRAPH:DISCOVERINGDATASTRUCTUREINSTANCES USINGGRAPH- BASED SIGNATURES ............................ 40 4.1 ProblemStatement ............................ 41 4.2 Data Structure Definition Extraction ................... 43 4.3 Signature Generation ........................... 44 4.4 Scanner Generation ........................... 52 4.5 Handling Practical Issues ........................ 54 4.6 Implementation ............................. 56 4.7 Evaluation ................................ 57 4.7.1 Signature Uniqueness ...................... 57 4.7.2 Signature Effectiveness ..................... 58 4.7.3 Multiple Signatures ....................... 65 4.7.4 Performance Overhead ..................... 66 4.8 Summary ................................. 67 5 DIMSUM:DISCOVERINGDATASTRUCTUREINSTANCES USINGPROBAL- ISTIC INFERENCE .............................. 68 5.1 DIMSUMOverview ........................... 68 5.1.1 Key Observation ......................... 68 5.1.2 Challenges ............................ 70 5.1.3 Overview ............................ 71 5.2 DIMSUMDesign ............................ 71 5.2.1 Practical Problems ........................ 73 5.2.2 Probabilistic Inference ...................... 74

vii Page 5.3 Generating Constraints .......................... 79 5.3.1 Primitive Constraints ...................... 79 5.3.2 Structural Constraints ...................... 82 5.3.3 Pointer Constraints ....................... 83 5.3.4 Same-Page Constraints ..................... 85 5.3.5 Semantic Constraints ...................... 86 5.3.6 Staged Constraints ........................ 86 5.4 Implementation ............................. 87 5.5 Evaluation ................................ 89 5.5.1 Experiment Setup ........................ 89 5.5.2 Effectiveness ........................... 91 5.5.3 Sensitivity on the Threshold ................... 98 5.5.4 Performance Overhead ..................... 98 5.6 Summary ................................. 99 6 APPLICATIONS ................................ 100 6.1 Memory Image Forensics ........................ 100 6.1.1 Typing Reachable Memory ................... 100 6.1.2 Typing Dead Memory ...................... 104 6.2 Vulnerability Fuzzing .......................... 105 6.3 Kernel Rootkit Detection ......................... 110 6.4 Kernel Version Inference ......................... 113 7 LIMITATION AND FUTURE WORK ..................... 116 7.1 REWARDS ............................... 116 7.2 SigGraph ................................. 117 7.3 DIMSUM ................................ 120 8 RELATED WORK ............................... 122 8.1 Type Inference .............................. 122 8.2 Variable Recovery ............................ 123

viii Page 8.3 ProgramUnderstanding ......................... 125 8.4 Malware Signature Derivation ...................... 125 8.5 Protocol Reverse Engineering ...................... 126 8.6 Vulnerability Discovery ......................... 127 8.7 Kernel Rootkit Detection ......................... 127 8.8 Kernel Version Inference ......................... 128 8.9 Memory Forensics ............................ 129 9 CONCLUSION ................................ 131 LIST OF REFERENCES ............................. 134 VITA ........................................ 145

ix LIST OF TABLES Table Page 3.1 An example of running the online algorithm.Variable g 1 is a global, l 1 and l 2 are locals. .................................. 30 3.2 An example of running the off-line type resolution proce dure.The execution before timestamp 12 is the same as Table 3.1.Method N reuses l 1 and l 2 .. 30 4.1 Experimental results of signature uniqueness test .............. 57 4.2 Detail statistics on our static signatures ................... 58 4.3 Summary of data structure signatures for Linux kernel 2. 6.18-1 ....... 61 4.4 Experimental results of SigGraph signatures and value i nvariant-based signa- tures ..................................... 62 5.1 Predicate definitions used throughout the paper ............... 73 5.2 Boolean constraints with probabilities .................... 76 5.3 Summary on discovering data instances of interest for us er applications in Linux.Note the two ∗ false positives can easily be pruned by looking at the absolute value of the probability. ....................... 92 6.1 Result on the unreachable memory type using type fun 0x804a708 ..... 106 6.2 Number of vulnerability suspects reported with help of RE WARDS ..... 108 6.3 Result fromour vulnerability fuzzer with help of REWARDS ........ 109 6.4 Experimental result on kernel rootkit detection ............... 110 6.5 Detailed field offsets of task struct for kernel version inference .... 114 8.1 Capability comparison with existing techniques ............... 130

x LIST OF FIGURES Figure Page 1.1 A framework for reverse engineering of data structure fr ombinary ..... 7 2.1 An overview of our data structure reverse engineering fr amework ...... 12 3.1 An example showing how REWARDS works ................ 23 3.2 Evaluation results for REWARDS accuracy and efficiency ......... 36 4.1 A working example of kernel data structures and a graph-b ased data structure signature.The triangles indicate recursive definitions ............ 40 4.2 SigGraph systemoverview .......................... 43 4.3 Examples illustrating the signature isomorphismprobl em .......... 45 4.4 If the offset of field e1 (of type struct G ) in E is changed to 16, struct A will have two possible signatures (detailed data structure definitions in Fig- ure 4.3) ................................... 50 4.5 The generated scanner for struct A ’s signature in Equation (4.9) .... 53 4.6 False positive analysis of vm area struct ................ 63 4.7 False positive analysis of dentry and sysfs dirent .......... 64 4.8 Memory scanning performance ....................... 66 5.1 Death-span of free frames froma terminated Firefox proc ess ........ 69 5.2 Overview of DIMSUM ........................... 71 5.3 Data structure definition of our working example .............. 72 5.4 Factor graph example ............................ 78 5.5 Float point representation .......................... 81 5.6 Common high bits in a time data structure .................. 81 5.7 The factor graph enhanced with a pointer constraint.Cons traints C 3 and C 6 are elided for readability.The modified part is highlighted .Constraints and variables local to a page are boxed. ..................... 85 5.8 Sample code on using Infer.NET to model C 1 - C 4 in our working example and compute p ( x 1 ) . ................................ 87

xi Figure Page 5.9 Factor graph of the example code from Infer.NET.Note eac h variable is de- noted as a circle and each factor or constraint as a square.If a variable par- ticipates in the factor or the constraint,then an edge is sho wn between the corresponding circle and square. ....................... 88 5.10 Effectiveness evaluation of DIMSUM for discovering us er login record data structure utmp . ............................... 95 5.11 An abstraction of the utmp case.The node in the middle is missing. .... 96 5.12 The threshold impact on the experimental result ............... 97 5.13 Comparison of the normalized execution time ................ 98 6.1 Part of a memory dump fromnull-httpd.String and IP addre ss are underscored.101 6.2 Comparison between the REWARDS-derived hierarchical view and source code definition ................................... 102 6.3 Memory dump for Slapper worm control program when exitin g the control interface ................................... 103 7.1 Profiling accesses to the fields of task struct .............. 118

xii ABSTRACT Lin,Zhiqiang Ph.D.,Purdue University,August 2011.Revers e Engineering of Data Struc- tures fromBinary.Major Professor:Dongyan Xu,Xiangyu Zhan g. Reversing engineering of data structures involves two aspec ts:(1) given an application binary,infers the data structure definitions;and (2) given a memory dump,infers the data structure instances.These two capabilities have a num ber of security and forensics applications that include vulnerability discovery,kerne l rootkit detection,and memory forensics. In this dissertation,we present an integrated framework fo r reverse engineering of data structures from binary.There are three key components in ou r framework:REWARDS, SigGraph and DIMSUM.REWARDS is a data structure definition rev erse engineering component that can automatically uncover both the syntax an d semantics of data structures. SigGraph and DIMSUM are two data structure instance reverse engineering components that can recognize the data structure instances in a memory d ump.In particular,SigGraph can systematically generate non-isomorphic signatures fo r data structures in an OS kernel and enable the brute force scanning of kernel memory to find th e data structure instances. SigGraph relies on memory mapping information,but DIMSUM, which leverages proba- bilistic inference techniques,can directly scan memory wi thout memory mapping informa- tion. We have developed a number of enabling techniques in our fram ework that include (1) bi-directional (i.e.,backward and forward) data flow analy sis,(2) signature graph gen- eration and comparison,and (3) belief propagation based pr obabilistic inference.We demonstrate how we integrate these techniques into our reve rse engineering framework in this dissertation.

xiii We have obtained the following preliminary experimental re sults.REWARDS achieved over 80% accuracy in revealing data structure definitions ac cessed during an execution. SigGraph recognized Linux kernel data structure instances with zero false negative and close-to-zero false positives,and had strong robustness i n the presence of malicious pointer manipulations.DIMSUM achieved over 20% accuracy improvem ent than previous non- probabilistic approaches without memory mapping informat ion.

1 1.INTRODUCTION 1.1 Dissertation Statement In computer science,a data structure is a particular way of s toring (e.g.,using array, tree,or graph) and accessing (e.g.,sequential,pre-order ,or depth-first) data in a computer so that it can be used efficiently [1].Typically,a data struc ture is composed of a number of fields,and each field has a specific type.The organization o f the data structure fields forms a layout (i.e.,the syntax of the data structure).The types,which tell the computer and the programmer information about the values and operati ons that a specific data type can handle,concern the semantics of the data structure.Almost all programs use data structures,namely,software development is essentially “ Algorithms + Data Structures = Programs” [2]. A data structure usually has two representations.One is the abstract representation, which is the definition of the data structure and this definiti on is determined by the pro- grammers and used during software development.The other is the concrete representation, which is the instance of the data structure and this instance is created at run-time during programexecution. A data structure is started from a programmer’s definitions a nd is eventually translated into a binary formwhen the software is compiled.In the rever se direction,“ can we reverse engineer the data structure definitions (i.e.,the abstract representation of data structure) from binary code? ” Also,the data structure is instantiated as data structure instances in memory at run-time.The instances are just the rawbits and by tes.Thus,“ can we recognize the specific data structure instances from raw memory images ? ” Both compiled code and raw memory images are in binary forms.T hat is why we call this process reverse engineering of data structures .In general,“Reverse engineering is the process of analyzing a subject system to create representat ions of the system at a higher

2 level of abstraction” [3].More specifically,we aim to reverse engineer the data structure definitions frombinary code ,and recognize data structure instances froma memory image . These two capabilities have many computer security and fore nsics applications. 1.2 Why Data Structure Reverse Engineering is Important Knowledge of data structure is valuable in many application s.During software devel- opment,compilers use data structure semantics to detect me aningless or probably invalid code [4].For example,a compiler can identify an expression like an integer divided by a string as invalid because,in the usual sense,one cannot div ide an integer by a string.Also, a compiler may use the static type of a value to optimize the st orage it needs as well as the operations on the value.For example,according to the IE EE specification for single- precision floating point numbers (the static types) [5],man y C compilers represent the float data type in 32 bits though theoretically it should be ( −∞ , + ∞ ) ,and uses floating-point- specific operations on those values. In the context of software debugging,programmers often nee d to knowboth the seman- tics and the syntax of the data structure to examine a specific memory cell.For example, if a programmer wants to examine a stack variable,he or she mu st know in advance the types (e.g.,integer,string,or pointer),and then use the s pecific format needed to display the value. In addition to traditional applications in software develo pment,data structure knowl- edge has a wide impact in computer security and forensics,su ch as in the following exam- ples. • Vulnerability Discovery – Knowledge about data structure layout is often used by attackers.For example,a buffer overflow attack relies on th e attacker knowing that a buffer is close to a function pointer or return address [6].S uch a data structure layout pattern can actually guide the vulnerability discovery.Fo r example,if a penetration tester or an attacker knows the layout of a stack frame or netw ork message,he or

3 she can reduce the fuzz [7–11] space and speed up the vulnerab ility discovery as demonstrated in Packet Vaccine [12] and ShieldGen [13]. • Exploit Generation – An exploit is a particular input that can trigger a vulnerabi l- ity [14–16].To compromise a remote machine,attackers ofte n construct exploits based on the program data structures,because what the attac ker can manipulate is always the input data of the program.For example,from the da ta structure syntax (e.g.,the size of a buffer),an attacker could directly know the exact distance between an exploitable buffer and a return address or a function poin ter,and thereby easily manipulate his input to hijack the control flow. • Protocol Format Reverse Engineering – Protocol format reverse engineering aims to reveal the format of incoming and outgoing network messag es [17,18].Such messages are usually composed of a number of program-defined data structures.If we can reverse engineer the data structures of a program,the n we can correlate the incoming and outgoing messages with the reverse engineered data types.A number of recent protocol reverse engineering techniques (e.g.,[ 19–22]) have followed such methodology. • Memory Forensics – Memory forensics is to identify,extract,and analyze meani ng- ful information from a piece of memory dump in a forensically sound manner [23– 25].Samples of such information are IP addresses to which th e application under investigation is talking and files that are being accessed.D ata structure definitions play a critical role in the extraction process.For instance ,without data structure information,it is challenging to decide if four consecutiv e bytes represent an IP address or are only a regular integer.As such,if there is a te chnique that can automatically extract the data structure with both the synt ax and semantics,then such a technique can directly help construct a meaningful viewof the in-memory data and benefit the memory forensics process. • Virtual Machine Introspection (VMI) – VMI is a technique that observes the state of an entire OS from the virtual machine level [26].There is o ften a semantic gap

4 between the guest OS and the host OS [27].In VMI,the layout an d semantics of the guest kernel data structure directly control the external i nterpretation of kernel events [28,29].For example,without knowing the data structure of a process descriptor,it is impossible to interpret the guest kernel memory with the r ight semantics at host side. • Kernel Rootkit Detection – A kernel rootkit is kernel level malware that hides important kernel objects such as process descriptors or ker nel modules.To hide the kernel objects,a rootkit attacker usually has knowledge of the corresponding kernel data structure definitions.For example,to hide a malicious process,the attacker could manipulate the previous and next pointer of the runnin g process list,and then hide it.As a result,a rootkit detector could use the signatu re of the corresponding data structure and scan the memory in order to recognize the h idden object [30–32]. • Malware Classification – To classify malware,anti-virus software primarily relies on the signatures in the malware code.As data structure is on e of the important aspects of a program,if the data structure has some unique pa tterns to a particular program,it is thus possible to use the data structures as mal ware signatures.In particular,as demonstrated in Laika [33],we can automatic ally derive the syntax of the data structure from malware code through machine lear ning techniques and use such syntax as malware signatures. • OS Kernel Fingerprinting – Similar to malware classification using data structures, we could also use the kernel data structures to fingerprint an OS kernel,which is particularly desirable in cloud computing.For example,a p ublic cloud computing platform usually hosts virtual machines (VMs) with various OS kernels.In order to perform VMI [26,28,34] on these guest VMs,a prerequisite is to know the specific version of a guest’s OS kernel [35–37].However,suc h information is not always available to the cloud provider.As will be shown in ou r dissertation,the unique signatures of kernel data structures can sometimes s erve as the fingerprint of a specific OS.

5 1.3 Why Data Structure Reverse Engineering is Challenging There are a number of new challenges in (1) reverse engineeri ng data structure def- initions from an application binary,and (2) recognizing da ta structure instances from a memory image. First,to reverse engineer the data structure syntax and sem antics,we have to first disassemble [38] and then analyze the binary code. • Static analysis to reveal the data structure is difficult because of the lack o f symbolic information.Also,alias analysis is particularly hard whi le it is essential to deciding data flow and hence variable semantics.For example,variabl e discovery [39] is a static analysis technique that recovers the syntactic char acteristics of variables,such as a variable’s offset in its activation record,size,and hi erarchical structure.This technique requires alias analysis and abstract interpreta tion at binary level,and it does not recover any semantic information about data struct ures. • Dynamic analysis is challenging as well because many variables are dynamical ly created and de-allocated at runtime,making it complicated to track and resolve the variable types based on memory locations,which may be re-us ed during runtime. The dynamic variable lifetime may also affect the coverage of data structures as some may be de-allocated before their types are resolved. Second,to recognize the data structure instances,the stat e-of-the-art solution is to de- rive the value-invariant data structure signatures,and us e them to scan memory.However, the following are new challenges. • The value-invariant is not always available .Many existing solutions rely on the field value invariant exhibited by a data structure (i.e.,a field with either a cons tant value or a value in a fixed range) as its signature [32,40–43]. Unfortunately,many kernel data structures cannot be covered by the value-invar iant scheme.For example, some data structures do not have fields with invariant values or value ranges espe-

6 cially for pointers.It is also possible that an invariant-v alue field may be corrupted, thereby making the corresponding data structure instance u n-recognizable. • Avoiding signature isomorphism when leveraging points-to relation. Comple- mentary to the value-invariant,we could explore the struct ural invariant (i.e.,the points-to shape of the data structure) as signatures.Howev er,it is possible that two distinct data structures may lead to isomorphic signatures that cannot be used to distinguish instances of the two data structures.Hence t here is a new challenge to identify the sufficient and necessary conditions to avoid signature isomorphism between data structures. Third,to scan memory and traverse the points-to relation be tween data structures,we have to resolve memory mapping [30,31],namely,given a virt ual address,we need to resolve its destination’s physical address.However,exis ting techniques to do so are only suitable for live data instances.As such,we have to recogni ze the dead data instances (i.e.,the data instances that has been deallocated or belon g to a dead process).The new challenges include the following in particular. • Absence of memory mapping information .Given a set of memory pages,very little information is available about which pages belong to which process,let alone the sequencing of the physical pages in the virtual address s pace of a process.Even if we can identify some pointers in a page,it is very hard to fo llow those pointers without the address mapping information. • Absence of type/symbolic information for dead memory .To map the raw bits and bytes of a memory page to meaningful data structure insta nces,type information is necessary.For example,if the content at a memory locatio n is 0,its type could be integer,floating point,or even pointer.If these bits and bytes belong to the live memory and the symbolic information is available,then they can be typed through the reference path (as in [30]).However,such information i s not available.

7 Binary Code REWARDS Reverse Engineering of Data Structure Definitions Data Structure Data Structure Definitions Definitions Data Structure Instances SigGraph Mappable Memory DIMSUM Un-mappable Memory Data Structure Instances Reverse Engineering of Data Structure Instances Fig.1.1.:A framework for reverse engineering of data struc ture frombinary 1.4 Contributions Our dissertation will address these challenges,develop ne wtechniques to automatically reverse engineer data structure definitions,and recognize data structure instances fromnot only live memory,but also dead memory. Our contributions can be summarized as follows. • We present a systematic framework for reverse engineering o f data structures froma binary.As illustrated in Figure 1.1,this framework includ es three key components: REWARDS 1 (reverse engineering data structure definitions),SigGrap h 2 ,and DIM- SUM 3 (reverse engineering data structure instances),which wil l provide a complete solution for data structure and data instance reverse engin eering. • REWARDS is a first-step,dynamic analysis based data structure reverse engineering technique.Not only can it reveal the syntax (i.e.,the layou t) of the data structure,but the semantics (i.e.,the meaningful use) of the data structu res as well. 1 REWARDS is the acronymfor Reverse Engineering Work for Auto matic Revelation of Data Structures. 2 SigGraph stands for Graph based Sig natures. 3 DIMSUMis the acronymfor Discovering InforMation with Sema ntics fromUn-mappable Memory.

8 • SigGraph proposes that points-to relations between data st ructures can be leveraged to generate graph-based structural invariant signatures.SigGraph is a technique that can systematically generate non-isomorphic signatures for data structures in an OS kernel and enable the brute-force scanning. • DIMSUMis the first technique that can uncover semantic data o f interest in memory pages without memory mapping information.It is a probabilistic inference -based approach,and is able to automatically build graphical mode ls based on boolean constraints generated fromthe data structure and the memor y page contents. • We have implemented our reverse engineering framework,and our experimental results show that this framework is highly efficient and prac tical.In particular, REWARDS achieves high accuracy in revealing the data structur es accessed during an execution;SigGraph can recognize Linux kernel data stru cture instances with zero false negative and close-to-zero false positives;and DIMSUMachieves higher effectiveness than previous non-probabilistic approache s without memory mapping information. 1.5 Scope of this Dissertation This dissertation presents a suite of new techniques for rev erse engineering of data structures froma binary,and there are a number of assumptio ns. • Architecture – We tested our techniques on X86 platform.There are some mod ifica- tions when applying our techniques to other platforms.For i nstance,when examining memory content,as X86 is little endian but PowerPC is big end ian,we have to accordingly adjust this difference when scanning memory. • Operating System – We also assume the operating system is UNIX/Linux.These are the testing operating system(OS) we used during our prot otype development. • Programming Languages – We assume the programis written in C/C++ (the system programming language).For other programs written in such a s Java (byte code),or

9 Python,or Perl,they are out of scope of this dissertation,a s they have different run- time mechanisms than that of the native binary code compiled fromC/C++. • Compilers – Correspondingly we assume the programs are compiled by stan dard compilers such as gcc .This is because different compilers will have slightly dif fer- ent ways in organizing data structure layout and passing the function parameters. • Binary Code – We also assume no obfuscation [44] against the binary code, and no layout randomization [45] against the data structure. • Memory Mapping – Our technique relies on pointer navigation.We assume the virtual address translation mechanism (e.g.,the page mapp ing) still exists for our SigGraph approach. • Unencryped Memory Pages – Though the input to our DIMSUMis a set of pages, fromwhich we can identify the data instances of interest wit h confidence,we assume such pages are not encrypted. 1.6 Dissertation Overview This dissertation presents a data reverse engineering fram ework that contains a number of new techniques (i.e.,REWARDS,SigGraph,and DIMSUM).Each technique has its own distinct goals,challenges,and input,but all are geare d towards uncovering the data structures and are naturally integrated into our framework .In particular,as illustrated in Figure 1.1,REWARDS lays the foundation of this framework as Si gGraph and DIMSUM both rely on the data structure definitions.SigGraph and DIM SUM complement each other as SigGraph focuses on mappable memory while DIMSUMfo cuses on unmappable memory. An outline of this dissertation is as follows. • Chapter 1 explains the need for our reverse engineering framework bas ed on a number of security and forensics applications,for which it would be highly useful.

Full document contains 163 pages