This paper is available on arxiv under CC 4.0 license.
Authors:
(1) JOANNA C. S. SANTOS, University of Notre Dame, USA;
(2) MEHDI MIRAKHORLI, University of Hawaii at Manoa, USA;
(3) ALI SHOKRI, Virginia Tech, USA.
Object serialization and deserialization is widely used for storing and preserving objects in !les, memory, or database as well as for transporting them across machines, enabling remote interaction among processes and many more. This mechanism relies on re"ection, a dynamic language that introduces serious challenges for static analyses. Current state-of-the-art call graph construction algorithms does not fully support object serialization/deserialization, i.e., they are unable to uncover the callback methods that are invoked when objects are serialized and deserialized. Since call graphs are a core data structure for multiple type of analysis (e.g., vulnerability detection), an appropriate analysis cannot be performed since the call graph does not capture hidden (vulnerable) paths that occur via callback methods. In this paper, we present Seneca, an approach for handling serialization with improved soundness in the context of call graph construction. Our approach relies on taint analysis and API modeling to construct sound call graphs. We evaluated our approach with respect to soundness, precision, performance, and usefulness in detecting untrusted object deserialization vulnerabilities. Our results show that Seneca can create sound call graphs with respect to serialization features. The resulting call graphs do not incur signi!cant overhead and was shown to be useful for performing identi!cation of vulnerable paths caused by untrusted object deserialization.
CCS Concepts: • Software and its engineering - Automated static analysis; Software verification and validation; General programming languages; • Social and professional topics - History of programming languages.
Additional Key Words and Phrases: object serialization, taint analysis, call graphs
ACM Reference Format:
Joanna C. S. Santos, Mehdi Mirakhorli, and Ali Shokri. 2024. Sound Call Graph Construction for Java Object Deserialization. Proc. ACM Program. Lang. 1, CONF, Article 1 (January 2024), 27 pages.
Static program analysis is a key component of today’s software analysis tools that bring automation into activities such as defect localization and/or !nding (e.g., [Dolby et al. 2007; Thaller et al. 2020]), vulnerability detection (e.g., [Jovanovic et al. 2006; Liu Ping et al. 2011]), information "ow analysis [Sridharan et al. 2011], code refactoring (e.g., [Khatchadourian et al. 2019]), code navigation (e.g., [Feldthaus et al. 2013]), code clone !nding (e.g., [Wyrich and Bogner 2019]), and optimization [Hines et al. 2005]. Such tools often perform multiple types of inter-procedural analysis, that leverage call graphs – data structures that indicate caller-callee relationships [Grove and Chambers 2001]. However, prior works have demonstrated that constructing a call graph for object-oriented programs is often non-trivial, expensive and/or non-feasible due to the usage of many dynamic programming language constructs. For instance, native calls, re"ection, and object serialization make it challenging to statically construct a sound call graph [Ali et al. 2019; Kummita et al. 2021; Reif et al. 2019, 2018; Smaragdakis et al. 2015; Sridharan et al. 2013].
These programming constructs are heavily used in contemporary software systems as they enable the developers to link/load new class libraries, methods, and objects and extend the programs’ functionalities [Landman et al. 2017; Reif et al. 2019]. Ignoring such constructs leads to unsound call graphs in which feasible runtime paths are missed, and call graphs cannot be used to infer the possible execution from the code [Reif et al. 2019, 2018; Sridharan et al. 2013]. To tackle this problem, previous works explored certain classes of language features, such as re"ection features [Bodden et al. 2011; Li et al. 2014, 2019; Smaragdakis et al. 2015], native (opaque) code [Smaragdakis et al. 2015], dynamic proxies [Fourtounis et al. 2018], and programs with Remote Method Invocation (RMI) [Sharp and Rountev 2006]. However, as demonstrated by Reif et al. [Reif et al. 2019, 2018], a powerful and frequently used programming construct that has been left out from the programming analysis techniques is serialization (and deserialization) of objects.
Object serialization is the process of converting (the state of) an object into an abstract representation (e.g., a byte stream or JSON, etc.). The reverse process of reconstructing objects from its abstract representation is called deserialization. This is a widely used mechanism for storing and preserving objects in !les, memory, or database as well as for transporting them across machines, enabling remote interaction among processes and many more. For example, the Android API provides a Bundle object which can be used for inter-process communication between apps as well as Android’s OS with an individual app via their serialization and deserialization [Arzt et al. 2014; Enck et al. 2014]. Moreover, object (de)serialization is also used to improve the system’s performance by saving objects for later retrieval, e.g., saving a trained machine learning model to be used later without the need to retrain the algorithm. Serializing an object has other advantages, such as being readable by applications in other languages. For instance, JavaScript running in a web browser can natively serialize and deserialize objects to and from JSON, therefore interact with other applications written non-JavaScript languages.
Although object serialization is widely used in many languages and commonly adopted by programmers, static analyzers do not fully cover analysis of programs with this construct yet [Reif et al. 2019, 2018]. This is particular important considering the spike of vulnerabilities related to untrusted object deserialization [Muñoz and Schneider 2018; Sayar et al. 2023; Schneider and Muñoz 2016] that cannot be automatically detected because call graphs are unsound. For example, Apache’s Log4j software library (versions 2.0-beta9 to 2.14.1) had an untrusted object deserialization vulnerability that allowed remote code execution. This was a critical vulnerability that affected several software systems.
As demonstrated by previous studies on the soundness of call graph construction approaches [Reif et al. 2019, 2018]— guaranteeing that all possible behaviors are modeled in a call graph — state-of-theart techniques do not support serialization-related operations. They fall shortly in having nodes and edges that represent callback methods that are invoked during the serialization or deserialization of objects. There are multiple reasons on why it is hard to handle this language construct:
— Serialization and deserialization uses several overridable callback method(s). These call back methods are invoked by the Java API using “non-trivial” re"ective calls that current techniques [Landman et al. 2017] for taming re"ection do not address. Therefore, the resulting call graph underapproximate the program’s behavior; they miss potential program paths through these call back methods.
— The invoked callbacks during deserialization methods depend on the received object, which is coming from an external stream. The values and internal !eld types are only known at runtime when the object deserialization occurs.
— The external stream may include objects whose types are not observed statically, i.e., they are available in the classpath (imported libraries, or Java built-in API) but were never actually used (instantiated) in the application scope. A typical static analysis would consider these types as unused.
Therefore, existing techniques on addressing re"ections has failed to address call-graph generation with the presence of object serialization/de-serialization [Reif et al. 2019]. As such, potential program "ows are disregarded in existing call graph construction algorithms. Since the call graph is a core data structure in performing many inter-procedural code analyses, the underlying client would suffer with the unsoundness. In use-cases such as detection of untrusted deserialization vulnerabilities, an appropriate analysis cannot be performed since the call graph does not capture hidden (vulnerable) paths that occur via callback methods. There are two algorithms that (partially) handle serialization constructs (i.e., CHA [Dean et al. 1995] and RTA [Bacon and Sweeney 1996]) but they are imprecise; they abstract program executions to consider more paths than those feasible in the program. Therefore, they introduce spurious nodes and edges, rendering large call graphs. Relying on such algorithms for downstream analyses (e.g., vulnerability detection) makes the analysis imprecise, resulting in a high amount of false positives.
A recent line of work [Santos et al. 2021, 2020], presented an approach (named Salsa) for providing support for serialization-related features. Although Salsa aids the static analyses of programs that uses Java’s serialization/deserialization API, it is not enough to find hidden (potentially) malicious paths in the program. Salsa relies on API modeling for abstracting the serialization/deserialization protocol which dictates callback methods control and data "ow. Specially, it relies on downcasts in the program to infer the callbacks invoked during deserialization. However, malicious objects often violate downcasts and are crafted in such way that it triggers the exploit during deserialization, i.e., the exploit executes before the downcast is performed [Dietrich et al. 2017a].
Therefore, we introduce in this paper Seneca, a novel approach that handles the challenge of constructing call graphs for programs that uses serialization features. Specially, we are focusing on improving the call graph’s soundness for Java programs with respect to serialization and deserialization callbacks without greatly affecting its precision. Seneca performs a novel taint-based call graph construction, which relies on the taint state of variables when computing possible dispatches for callback methods.
The contributions of this work are:
— a novel taint-based call graph construction algorithm to improve call graphs’ soundness with respect to deserialization callbacks. It is agnostic to the underlying pointer analysis method used to construct a call graph, and it is meant to complement them.
— an evaluation of the approach’s soundness, precision, and scalability. Our experiments demonstrated that our approach soundly handled all the six different callbacks that can be invoked during serialization or deserialization.
— a publicly available implementation of Seneca [1].
The rest of this paper is organized as follows: Section 2 describes the serialization and deserialization mechanism and the challenges in creating a call graph that is sound with respect to this feature. Section 3 explains our approach. Subsequently, Section 4 presents the evaluation of the approach whereas Section 5 presents the results. Section 6 contextualizes our approach within the state-of-the art. Section 7 concludes the paper and make final considerations.
[1] The scripts to reproduce the paper results are currently available on an anonymous repository SenecaRepo. The source code for Seneca will be released upon publication and submitted for artifact evaluation