paint-brush
Sound Call Graph Construction for Java Object Deserialization: Seneca: Taint-Based Call Graphby@escholar

Sound Call Graph Construction for Java Object Deserialization: Seneca: Taint-Based Call Graph

tldt arrow

Too Long; Didn't Read

Object serialization and deserialization is widely used for storing and preserving objects in !les, memory, or database as well as for transporting them across machines, enabling remote interaction among processes and many more.
featured image - Sound Call Graph Construction for Java Object Deserialization: Seneca: Taint-Based Call Graph
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) JOANNA C. S. SANTOS, University of Notre Dame, USA;

(2) MEHDI MIRAKHORLI, University of Hawaii at Manoa, USA;

(3) ALI SHOKRI, Virginia Tech, USA.

Table of Links

3 Seneca: Taint-Based Call Graph Construction for Object Deserialization

To support serialization-related features, Seneca employs an on-the-"y iterative call graph construction technique [Grove et al. 1997], as depicted in Figure 4. It involves two major phases: 1 Iterating over a worklist of methods to create the initial call graph using an underlying pointer analysis method; 2 Refinement of the initial call graph by making a set of assumptions performed iteratively until a fixpoint is reached (i.e., when there are no more methods left in the worklist to be visited).

3.1 Phase 1: Initial Call Graph Construction

Seneca !rst takes as input a CSV file with method signatures for the program’s entrypoints , which are the methods that start the program’s execution (e.g., main()). The result of this step is a set of entrypoint methods < 2 ⇢ added to our worklist W. This worklist tracks the methods < under a context 2 that have to be traversed and analyzed, i.e., h<,2i 2 W, where a context 2 is an abstraction of the program’s state. Since the worklist W tracks methods within a context, the entrypoints methods added to W are assigned a global context, which we denote as ;. Hence, the worklist is initialized as:




These synthetic models are initially created without instructions. Their instructions are constructed during the call graph refinement phase (Phase 2). It is important to highlight that the calls to synthetic methods (models) are 1-callsite-sensitive [Sridharan et al. 2013]. We use this context-sensitiveness policy to account for the fact that one can use the same ObjectInputStream/ObjectOutputStream instance to read/write multiple objects. Thus, we want to disambiguate these paths in the call graph.



3.2 Phase 2: Call Graph Refinement

In this phase, we take as input the current call graph 68 which contains as nodes actual methods in the application and synthetic methods created by our approach in the previous phase.



Subsequently, we iterate over all non-static fields 5 from the class C and compute their points-to sets (see the foreach in line 10). If the concrete types allocated to the field contains callback methods, we add three instructions: (i) an instruction to get the instance field 5 from the object; (ii) a downcast to the field’s type; (iii) an invocation to the callback method from the field’s declaring class.


It is important to highlight the edge case scenario when the object being serialized is a java.util.Collection or a java.util.Map. In this case, Seneca tracks what objects were added to the collection in order to add invocations to their callback methods (if provided).




3.2.2 Taint-Based Object Deserialization Abstraction Starting from the deserialization points identified, Seneca computes the call graph on-the-"y by iteratively solving constraints over the


Table 1. Taint propagation rules employed by Seneca when building call graphs.




Therefore, Seneca initializes the following pointers as tainted:



Taint Propagation Rules:


As the method’s instructions are parsed, we employ the rules listed in Table 1 to compute the taint states of the program’s variables. As shown in Table 1, the rules for assignment instructions are as follows:



That is, the pointer for the left-hand side is tainted if the pointer for the right-hand side is also tainted (or the left-hand side itself was already previously tainted). This is the case for the rules Load-Static, Load-Instance, Store-Instance, Store-Static, Static-Call-Return, Return, Array-Load, Array-Store, and Checkcast.



It is worth to highlight that taint is never removed from a pointer. Although this will make the underlying call graph more imprecise, our goal is to soundly reason over all possible runtime paths.


–Side Effects to the Pointer Analysis Engine:


Method invocations and return instructions introduce side-effects to the static analysis engine state, labelled in Table 1 as Call-Side-Effect and Return-Side-Effect, respectively.


- Instance method invocations: When there is an instance method invocation >.6(...) and the object > is tainted, then Seneca computes the possible method targets for the call >.6(...) soundly. The dispatch is computed as described below:



As one can notice, this dispatch is similar to the one employed by Class Hierarchy Analysis (CHA). The main difference are in steps (4) and (5), where Seneca takes into account class visibility rules as well as whether the type is serializable.



— Context-sensitivity for Tainted Method Calls Our taint-based call graph construction algorithm is agnostic to the pointer analysis policy (e.g., 0-1-CFA). This means that a client analysis could choose to use a context insensitive analysis (e.g., 0-CFA). Since tainted pointers are likely to have a large points-to set because we use a sound analysis to compute all possibilities for method dispatches when the receiver object is tainted, we should avoid merging point-to-sets of these tainted variables. Otherwise, the resulting pointer analysis would be too imprecise to be used by downstream client analyses.


Therefore, we use 1-callsite-sensitivity for tainted method calls (even if we use an insensitive analysis for all the other pointers).


Demonstrative Example: Consider the code snippet in Listing 3. The class Main has a main method that reads an object from a file, whose path is provided as a program argument. This program contains other four classes (CacheManager, TaskExecutor, CommandTask, and Config). We demonstrate Seneca’s taint-based deserialization modeling considering that we selected 0-1- CFA as the main pointer analysis method.


Listing 3. Walk-through example to demonstrate Seneca’s approach



There are three method invocations on Main.main(): two invocations to the constructors () of FileInputStream and ObjectInputStream classes followed by a call to the readObject() method from the ObjectInputStream class. The invocation to ObjectInputStream.readObject() is replaced by Seneca with a model (synthetic) method that has the same signature, but it is initialized without any instructions. At this stage, the call graph for this program after traversing the main method looks like as shown in Listing 5. All these three call graph nodes discovered after parsing Main.main() are added to the worklist to be processed (i.e., , FileInputStream.(), ObjectInputStream.(), and ObjectInputStream.readObject().


Fig. 5. Initial call graph after parsing the Main.main() method in Listing 3


The instructions that are added to ObjectInputStream.readObject() relies on taint states to infer callback methods that might by invoked during deserialization. Thus, when re!ning a method model, Seneca considers that all serializable classes in the classpath could have its callbacks invoked. By using this strategy, there are two possible callbacks that can be invoked: one from Config and one from CacheManager. Hence, all of its instance !elds are marked as tainted per the taint introduction rules previous described (these are highlighted in red on Listing 3). Based on the taint propagation rules speci!ed on Listing 1, variables are then marked as tainted (these variables that are tainted due to propagation are highlighted in cyan on Listing 3).





[3] We point the reader to the work by Sridharan et al. [Sridharan et al. 2013] which provides a generic formulation for multiple points-to analysis policies.


[4] Visibility rules are thoroughly described in the language speci!cation https://docs.oracle.com/javase/specs/jvms/se7/html/ jvms-4.html#jvms-4.1-200-E.1