This paper is available on arxiv under CC 4.0 license.
Authors:
(1) JOANNA C. S. SANTOS, University of Notre Dame, USA;
(2) MEHDI MIRAKHORLI, University of Hawaii at Manoa, USA;
(3) ALI SHOKRI, Virginia Tech, USA.
Table of Links
- Introduction
- Background
- Seneca: Taint-Based Call Graph Construction for Object Deserialization
- Evaluation
- Results
- Related Work
- Conclusions, Acknowledgment, and References
4 Evaluation
In this section, we introduce our research questions and describe our experiment setup and design to answer those.
4.1 Research Questions
This paper addresses the following research questions:
RQ1 Soundness. Does Seneca handle object deserialization soundly?
RQ2 Precision. Does an increase in the call graph’s soundness incur a significant loss in its precision?
RQ3 Scalability. Does Seneca scale well for real software systems?
RQ4 Usefulness. Is Seneca useful for a client analysis focused on vulnerability detection?
4.2 Answering RQ1: Soundness
We aim to verify whether Seneca improves a call graph’s soundness with respect to deserialization callbacks and how it compares with existing approaches [Reif et al. 2019, 2018; Santos et al. 2021, 2020]. The soundness of a call graph construction algorithm corresponds to being able to create a call graph that incorporate all possible paths (nodes and edges) that can arise at runtime [Ali et al. 2019; Kummita et al. 2021]. In this work, we are specifically looking at improving a call graph’s soundness to cover possible invocations that arise during object serialization and deserialization. Therefore, we use two datasets to answer this first research question:
• Call Graph Assessment & Test Suite (CATS) [Eichberg 2020]: This dataset was released as part of recent empirical studies [Reif et al. 2019, 2018] to investigate the soundness of the call graphs computed by existing algorithms with respect to particular programming language constructs. The CATS test suite6 was derived by an extensive analysis of real Java projects to create test cases that are representative of common ways that projects use these language constructs (e.g., lambdas, reflection, serialization, etc.). The dataset includes 9 test cases for verifying the soundness of call graphs during serialization and deserialization of objects. Each test case is a Java program with annotations that indicate the expected target for a given method call. Table 2 provides an overview of the test cases available in the CATS test suit and what aspects they aim to probe. Hence, in this !rst experiment, we run Seneca using two pointer analysis configurations: 0-1-CFA, and 1-CFA. Then, we compare it against Salsa (0-CFA, 1-CFA), a state-of-the-art tool, as well as the same algorithms used in the empirical study by Reif et al. [Reif et al. 2019], namely Soot (CHA, RTA, VTA, and Spark), Wala (RTA, 0-CFA, 1-CFA, and 0-1-CFA), Doop (context-insensitive), and Opal (RTA).
— Metric: Likewise to the prior empirical study by Reif et al. [Reif et al. 2019, 2018], we compute the number of failed and passed test cases for each approach as a way to investigate the soundness of our approach.
• XCorpus dataset: Although the CATS dataset was carefully constructed to test call graph construction algorithms with respect to programming language features, the test cases are small programs (i.e., with few serializable classes). Therefore, to enhance our analysis, we used programs available on the XCorpus dataset [Dietrich et al. 2017b]. We chose this dataset because it has been widely used in prior related works [Fourtounis et al. 2018; Santos et al. 2021, 2020] and it was manually curated to be representative of real Java projects. From this dataset, we selected a total of 10 programs from the XCorpus dataset [Dietrich et al. 2017b] (listed in Table 3). We selected these projects because they match the following criteria: (i) they perform object serialization / deserialization; (ii) they contain serializable classes that provide custom implementation for callback methods; hence, they would be suitable to verify whether our approach can properly compute a call graph that uncover hidden paths via callback methods.
For each of these 10 projects, we created a set of test cases that exercised the serialization and deserialization of objects from the classes that contained custom callback methods. Each test case serializes an object into a !le, and then deserializes it back from this !le, as shown in Listing 4. The systematic process we followed to create these test cases were as follows. For each class in the XCorpus program that had a custom callback method, we created a “simple” test case. This “simple” test case returns a single instance from the class inside the method getObject(). We read the project’s documentation to initialize the object’s !elds correctly and avoid exceptions thrown by the class’ constructor. We also created “composite” test cases in which the class instance is wrapped into a collection, i.e., an ArrayList, a HashSet, a HashMap, or an array.
By following this systematic process, we create !ve test cases (1 “simple” test case, and 4 “composite” ones) for each class with a custom callback in an XCorpus project. The list of XCorpus programs and the number of test cases for each of them is shown in Table 3.
After creating these test cases, we execute them to extract their dynamic call graph (runtime call graph). We implemented a JVMTI (Java Virtual Machine Tool Interface) agent in C to compute these runtime call graphs. This implementation has an instrumentation agent that is attached to the program’s execution. It captures every method that is invoked in the program and its caller method.
Since we aim to investigate whether our taint-based call graph algorithm handle object deserialization soundly or would unsound assumptions be able to find vulnerabilities, we compare Seneca against Salsa [Santos et al. 2021, 2020], a state-of-the-art tool that computes call graphs for object deserialization based on downcasts within the program, which yields to less sound call graphs.
Metric: Similar to prior works [Ali et al. 2019; Ali and Lhoták 2012; Kummita et al. 2021; Li et al. 2014; Smaragdakis et al. 2015], we verify our approach’s soundness based on the number of edges in the runtime call graph that are missing in the static call graph. In our comparison, we differentiate application-to-application edges, application-to-library, and library-to-library edges. That means that we disregard missing edges due to: (a) class initializers (because methods are modeled by Wala using a synthetic method that invokes all class initializers at once), (b) native code (because it cannot be statically analyzed), (c) explicitly excluded classes (i.e., classes inside our list of exclusions !le that are removed from the static call graph), and (d) library-to-library edges (i.e., edges from a built-in Java class to another built-in language class).
4.3 Answering RQ2: Precision
Although soundness is a desirable property for static analysis, in practice, however, creating a sound analysis also implies a loss of precision. Due to the undecidability of program veri!cation, it is impossible to create an analysis that is both sound and precise [Rice 1953]. Therefore, a sound analysis is an over-approximation that may include spurious results (e.g., unrealistic paths).
While our approach aims to enhance an existing call graph construction algorithm to handle serialization-related callbacks soundly, we need to verify whether our approach introduces imprecision and to what extent. Imprecision in this work refers to adding nodes and edges that will not arise at the program’s runtime during object serialization and deserialization [Ali et al. 2019]. In other terms, the precision of a call graph means that it does not contain nodes and edges that will not arise at runtime during object deserialization and serialization.
To answer this question, we use our JVMTI agent to compute the runtime call graph for each program in the CATS test suite [Eichberg 2020] and our manually constructed Test Cases derived from the XCorpus dataset [Dietrich et al. 2017b]. Subsequently, we compute the number of edges in the static call graph that did not exist in the runtime call graph.
— Metric: We calculate the number of nodes and edges that appeared in Seneca’s call graph but did not appear on the dynamic call graph. Similar to prior works [Smaragdakis et al. 2015; Smaragdakis and Kastrinis 2018], when performing this calculation, we only consider applicationto-application edges and application-to-library edges as long as these edges do not include nodes that are a class initializer, a native code method, or a method from an explicitly excluded class.
4.4 Answering RQ3: Performance
— Metric: We measure (i) the running time to compute the call graphs when using our approach, and (ii) the extra added number of iterations over the worklist of the call graph construction algorithm. We run these analyses on a machine with a 2.9 GHz Intel Core i7 processor and 32 Gb of RAM memory.
4.5 Answering RQ4: Efficiency
One of the premises of this work is that a taint-based call graph construction enable the computation of sound call graphs with respect to (de)serialization, which can be useful for client analyses, such as vulnerability detection. In this question, we aim to verify whether Seneca can help a static analysis technique in finding potential vulnerable paths in the program.
To answer this question, we obtained 3 open-source projects with known disclosed deserialization vulnerabilities. We selected these projects because their exploits have been widely discussed by practitioners and are available on the YSoSerial GitHub repository [Froho# 2018]. That is, these projects have well-known “gadget chains” which were previously disclosed in vulnerability reports.
To answer this RQ, we used Seneca and Salsa to compute the call graphs of these projects (con!gured to use 0-1-CFA and 1-CFA). Subsequently, we use these call graphs to extract vulnerable paths which are paths from ObjectInputStream.readObject() to sinks, i.e., method invocations to security-sensitive operations.
To identify sinks, we manually curated a list of security-sensitive method signatures. To do so, we extracted the list of sink methods from a prior published work [Thomé et al. 2017]. Moreover, we parsed the manifest !le from the Juliet Test Suite [Software 2023]. This test suite is a dataset from NIST (National Institute of Standards and Technology) which has a collection of synthetic C/C++ and Java code samples with di#erent software weaknesses (CWEs). Their manifest !le indicates all the !les for a test case, the kind of weakness it contains, and its location in the code. Thus, we parsed the manifest to extract the lines that are "agged as vulnerable, !ltered out the lines that are not method invocations, grouped them by signature, and manually identi!ed the ones that are sinks. After performing these two complementary curation steps, we obtained a total of 101 methods signatures for sinks.
— Metric: We measured how many vulnerable paths each approach was able to identify.
[7] https://github.com/wala/WALA/blob/master/com.ibm.wala.core/src/main/resources/Java60RegressionExclusions.txt