This paper is available on arxiv under CC 4.0 license.
Authors:
(1) JOANNA C. S. SANTOS, University of Notre Dame, USA;
(2) MEHDI MIRAKHORLI, University of Hawaii at Manoa, USA;
(3) ALI SHOKRI, Virginia Tech, USA.
This section describes the results of the experiments for measuring the soundness of the call graphs computed by Seneca.
As shown in Table 4, only our serialization-aware call graph construction (Seneca) and Salsa passed all of the nine test cases. Only three other algorithms partially provided support for callback methods, namely Soot') and Soot⇠ (2 out of 9) and OPAL') (5 out of 9) [Reif et al. 2019]. The remaining algorithms, i.e., Soot (VTA, and Spark), Wala (RTA, 0-CFA, 1-CFA, 0-1-CFA), and Doop (context-insensitive), did not provide support at all for serialization-related callback methods.
It is also important to highlight that the frameworks that provided partial support for serialization-related features (SootRTA, SootCHA, and OpalRTA) use imprecise call graph construction algorithms (CHA [Dean et al. 1995] or RTA [Bacon and Sweeney 1996]). Table 5 shows a comparison of call graphs’ sizes in terms of nodes and edges. As we can infer from these charts, the only call graph construction algorithms used by Soot, and Opal that provided partial support for serialization create much larger call graphs (in terms of the number of nodes and edges). Since these algorithms only rely on static types when computing the possible targets of a method invocation, they introduce spurious nodes and edges, thereby increasing the call graph’s size.
Our approach enhances the underlying pointer analysis policy in order to strike a balance between improving soundness while not greatly affecting the call graph’s precision by adding spurious nodes and edges. A more recent work, Salsa, also produced call graphs with reasonable sizes, very similar to ours. However, this is because the test cases in the CATS dataset are rather simple; they are up to two classes that exercise one custom call back method at a time. As we will discuss in the next subsection, Salsa’s ability to create sound call graphs is greatly diminished when using the source code of real software projects.
5.1.2 Dataset #2: XCorpus Dataset Figure 7 depicts the percentage of edges in the runtime call graph of the projects, that are missing on the static call graph computed by each approach. From this chart, we notice that Seneca outperformed Wala and Salsa. Our approach has less missing edges compared to other the approaches, i.e., it is able to soundly infer hidden paths through serialization callbacks.
When inspecting the edges that Seneca missed, we observed that these edges were unrelated to serialization callbacks. That is, these were edges to which the underlying pointer analysis algorithm cannot soundly infer the points-to sets of variables. For example, we observed edges that were missed because instructions were using re"ection to invoke methods. These were constructs that the underlying 0-1-CFA and 1-CFA pointer analysis provided by Wala (our baseline framework) could not correctly infer the dispatch.
One of the reasons as to why Salsa performed similar to Seneca with the CATS test suite but performed poorly on the XCorpus dataset has to do with its inability to compute potential method dispatches from classes in the classpath. As described in their work [Santos et al. 2021, 2020], the approach relies on downcasts of objects to infer what are the object(s) being deserialized. When downcasts are unavailable, the approach relies on a simple approach of computing all possible dispatches, but limited to classes on the application scope. Our approach, on the other hand, follows Java’s serialization specification and includes all classes in the classpath, irrespective of its scope (i.e., extension, primordial, or application scope).
This section describes the evaluation results of the precision of the call graphs computed by Seneca.
5.2.1 Dataset #1: CATS Figure 8 depicts the number of edges in the static call graph that were not present in the runtime call graph for the test cases in the CATS test suite [Reif et al. 2019]. As shown in this chart, Seneca was able to provide full support for serialization callbacks (passing all test cases, see Table 4) while maintaining reasonably sized call graphs. Compared to Soot and OPAL, the derived call graphs were far more imprecise. While Opal and Soot had over 800 imprecise edges, Seneca had between 95 and 343 incorrect edges.
This comparison also shows that Salsa’s performance was similar to Seneca. As explained in the previous section, however, this similar performance is caused by the fact that the programs in the CATS test suite are small, which does not include scenarios where Salsa’s unsound assumptions fall short.
5.2.2 Dataset #2: XCorpus Dataset Figure 9 plots the percentage of edges that are in the runtime call graph, but that are not in the static call graph of each approach. As observed on this chart, unsurprisingly, increasing the soundness of the call graph also increased the number of imprecise edges (i.e., edges that did not arise at runtime). The increase of missed edges is comparable to the one by Salsa.
When we inspected the imprecise edges, we noticed that those were related to serialization nodes, i.e., cases in which our call graph included all possible objects that can be serialized. Indeed, as our test cases serialized only one object at a time, all these edges are deemed as incorrect. However, as the Java API allows the deserialization of arbitrary types (i.e., any serializable type available on the class path), the edges in Seneca could arise at runtime if an object being read uses one of the other serializable classes (other than the one from the test case).
The observed differences, however, do not hinder the overall scalability of the approach. The approach still finishes within seconds of execution. Moreover, when further inspecting the worklist of our algorithm, we noticed that Seneca incurs between 3–6 extra iterations over Wala’s worklist. These extra iterations along with the taint analysis are the root cause for the extra running time needed for Seneca to finish.
We have implemented a client analysis that attempts to find vulnerable paths caused by untrusted object deserialization in a program. We then verified how well this client analysis could detect vulnerable paths by comparing its performance using the call graph computed by Salsa and the one generated by Seneca. The results for this experiment are shown in Table 6.
As shown in this table, Salsa’s call graphs were not suitable for performing vulnerability detection. The key issue lies on the unsoundness of Salsa. This approach relies on type casts (downcasts) to infer what object is being deserialized from a stream. However, as explained in Section 2.2.1, untrusted object deserialization vulnerabilities are caused by the ability of an attacker to craft arbitrary objects using any serializable class available in the classpath. Thus, even if the program performs a downcast over the serialized object, the exploit would have been executed anyway, as the vulnerability arises during deserialization and not after it.
Unlike Salsa, our approach was able to !nd vulnerable paths within our allocated time budget (of 15 minutes and up to 15 call graph nodes in a path). The identi!ed paths included the vulnerable paths from previously disclosed gadget chains, documented on the YSoSerial repository of deserialization exploits [Froho# 2018].