Using SQL Hotspots in a Prioritization Heuristic for Detecting All Types of Web Application Vulnerabilities: Difference between revisions

No edit summary
 
(23 intermediate revisions by the same user not shown)
Line 44: Line 44:
|-
|-
|}
|}
...


== 3. Related Work ==
== 3. Related Work ==
Line 64: Line 62:


== 4. Methodology ==
== 4. Methodology ==
{| class="wikitable" style="text-align: left; width: 100%;"
|+ Table 1. Results per Project
!
! WordPress
! WikkaWiki
|-
|Releases Analysed
|Nine
|Six
|-
| Security issue reports analyzed
| 97
| 61
|-
| Vulnerable files (over project's history)
| 26% (85 / 326)
| 29% (44 / 209)
|-
| Average number of hotspots (over project's history
| 255
| 92
|-
| Average percent of files having at least one hotspot
| 14.2%
| 8.42%
|-
|colspan="3" style="background: #eeeeee" | '''Hypotheses† about files'''
|-
| '''H1.''' The more hotspots a file contains per line of code, the more likely it is that the file contains any web application vulnerability.
| True (Logistic Regression, p<0.05)
| True (Logistic Regression, p<0.05)
|-
| '''H2.''' The more hotspots a file contains, the more times that file was changed due to any kind of vulnerability (not just input validation vulnerabilities).
| True (Simple Linear Regression, p<0.0001, Adjusted R2 = 0.4208)
| True (Simple Linear Regression, p<0.0001, Adjusted R2 = 0.3802)
|-
|colspan="3" style="background: #eeeeee" | '''Hypotheses about issue reports'''
|-
| '''H3'''. Input validation vulnerabilities result in a higher number average repository revisions than any other type of vulnerability*.
| True (MWW, p<0.05)
| True (MWW, p<0.05)
|-
|colspan="3" style="background: #eeeeee" | '''Hypotheses about prediction'''
|-
| '''H4.''' Hotspots can be used to predict files that will contain any type of web application vulnerability in the current release.
| True (Predictive Modeling, see Table 2)
| True (Predictive Modeling, see Table 3)
|-
| '''H5.''' The more hotspots a file contains, the more likely that file will be vulnerable in the next release.
| True (Positive Coefficient on Predictive Models)
| True (Positive Coefficient on Predictive Models)
|-
|colspan="3" style="background: #eeeeee" | '''Hypotheses comparing projects'''
|-
| '''H6.''' The average number of hotspots per file is more variable in WordPress than in WikkaWikki.
| colspan=2 | True (F-test, p<0.000001)
|-
| '''H7.''' WordPress suffered a higher proportion of input validation vulnerabilities than WikkaWiki.
| colspan=2 | True (Chi-Squared, p=0.0692)
|-
| '''H8.''' In WordPress, more of the lines of code that were changed due to security issues were hotspots.
| colspan=2 | True (Chi-Square, p<0.00001)
|-
| colspan=3 style="border-style: solid; border-width: 0 1px 1px 0" | *This finding is consistent with the report from SANS (see Section 1) that indicates that the most popular types of web application attacks are input validation vulnerabilities.
&dagger;Please note that we use the term "hypothesis" in this table with respect to scientific hypotheses and not statistical hypotheses.
|}


We conducted two case studies to empirically investigate eight hypothesis related to hotspot source code locations and vulnerabilities reported in the systems' bug tracking systems.  We present these hypotheses, as well their results, in Table 1. We will further explain the results in Section 5.  Our hypotheses point to the research objective: to improve the prioritization of security fortification efforts by investigating the ability of SQL hotspots to be used as the basis for a heuristic for the prediction of all vulnerability types.  We also include lines of code in our analysis as a way of improving the accuracy and predictive power of our heuristic along with SQL hotspots.  Specifically, we look at the relationship between hotspots and files (H1-H2), the amount of code change as related to the vulnerability type (H3), the predictive ability of hotspots for any vulnerability type (H4-H5), and the effect that collocating hotspots can have on the number and types of vulnerability in a given system (H6-H8).
We conducted two case studies to empirically investigate eight hypothesis related to hotspot source code locations and vulnerabilities reported in the systems' bug tracking systems.  We present these hypotheses, as well their results, in Table 1. We will further explain the results in Section 5.  Our hypotheses point to the research objective: to improve the prioritization of security fortification efforts by investigating the ability of SQL hotspots to be used as the basis for a heuristic for the prediction of all vulnerability types.  We also include lines of code in our analysis as a way of improving the accuracy and predictive power of our heuristic along with SQL hotspots.  Specifically, we look at the relationship between hotspots and files (H1-H2), the amount of code change as related to the vulnerability type (H3), the predictive ability of hotspots for any vulnerability type (H4-H5), and the effect that collocating hotspots can have on the number and types of vulnerability in a given system (H6-H8).
Line 101: Line 167:
The entire process of analyzing Trac reports and tracing these vulnerabilities to files was achieved using a script that we created. The script stored the following information as extracted from the project Trac website into the <code>tracs</code> dataset: the unique Trac report identification number, the date the report was created, the number of repository revisions due to the report, the number of files changed, and the number of lines of code changed due to the report.  
The entire process of analyzing Trac reports and tracing these vulnerabilities to files was achieved using a script that we created. The script stored the following information as extracted from the project Trac website into the <code>tracs</code> dataset: the unique Trac report identification number, the date the report was created, the number of repository revisions due to the report, the number of files changed, and the number of lines of code changed due to the report.  


The script determined the revision number using the date stored in the <code>tracs</code> dataset and by following the logic described in this section, stored the release number the issue report belonged to into the <code>tracs</code> dataset.  Issue reports can only refer to files that are in the repository at the time of the submission of the report.  As such, before gathering either the hotspot information or the information about which files were vulnerable, we downloaded the release of the version of the repository referred to by the report in question.  For example, in WordPress, we know that the file <code>wp-includes/script-loader.php</code> exists in WordPress v2.1 but not in WordPress v2.0.  We also know that <code>wp-includes/script-loader.php</code> is changed due to the cross-site scripting vulnerability reported in WordPress issue number 3937 .  Thus, by downloading the repository revision associated with v2.1 (number 4782), we can include the <code>wp-includes/script-loader.php</code> file in our analysis and properly record the file as vulnerable due to issue 3937.
The script determined the revision number using the date stored in the <code>tracs</code> dataset and by following the logic described in this section, stored the release number the issue report belonged to into the <code>tracs</code> dataset.  Issue reports can only refer to files that are in the repository at the time of the submission of the report.  As such, before gathering either the hotspot information or the information about which files were vulnerable, we downloaded the release of the version of the repository referred to by the report in question.  For example, in WordPress, we know that the file <code>wp-includes/script-loader.php</code> exists in WordPress v2.1 but not in WordPress v2.0.  We also know that <code>wp-includes/script-loader.php</code> is changed due to the cross-site scripting vulnerability reported in WordPress issue number 3937<sup>19</sup>.  Thus, by downloading the repository revision associated with v2.1 (number 4782), we can include the <code>wp-includes/script-loader.php</code> file in our analysis and properly record the file as vulnerable due to issue 3937.


=== 4.4. Classifying Vulnerabilities ===
=== 4.4. Classifying Vulnerabilities ===
Line 110: Line 176:


In WordPress, issues that were marked as security by WordPress developers were not always due to security.  For example, issue number 2041<sup>21</sup> reports a problem when updating to the latest version.  The reporter’s instance of WordPress was throwing an error when trying to display images.  The thread discussion eventually resolves the issue, and a patch is committed to the repository.  However, this issue is incorrectly categorized as a security problem by the developers because the problem deals with correctly updating the WordPress instance. We encountered nine reports like number 2041 that we determined to be not security-related.  We exclude these reports from our analysis in WordPress. Since we manually identified security issues in WikkaWiki, we classified issues as we identified them.  As such, there were no issue reports in WikkaWiki that were not security-related.  Therefore there were no reports to exclude.
In WordPress, issues that were marked as security by WordPress developers were not always due to security.  For example, issue number 2041<sup>21</sup> reports a problem when updating to the latest version.  The reporter’s instance of WordPress was throwing an error when trying to display images.  The thread discussion eventually resolves the issue, and a patch is committed to the repository.  However, this issue is incorrectly categorized as a security problem by the developers because the problem deals with correctly updating the WordPress instance. We encountered nine reports like number 2041 that we determined to be not security-related.  We exclude these reports from our analysis in WordPress. Since we manually identified security issues in WikkaWiki, we classified issues as we identified them.  As such, there were no issue reports in WikkaWiki that were not security-related.  Therefore there were no reports to exclude.
We manually edited the <code>tracs</code> dataset to add the CWE classifier to each issue report that was security related. We were also interested in comparing the proportion of input validation vulnerabilities in each project as a part of our research hypotheses (H7), so we additionally added a variable to the <code>tracs</code> dataset that indicated a "yes/no" as to whether the Trac report in question was due to an input validation vulnerability.  CWE classifies several vulnerability types as input validation vulnerabilities<sup>22</sup> and we followed this classification in our analysis.  We used the input validation vulnerability variable ''only'' for evaluating H7; we tested all other hypotheses and conducted the predictive modeling using the full dataset, irrespective of a reported vulnerabilities classification as input validation or non-input validation.
== 4.5. Detecting Changed Hotspots ==
We were also interested in comparing the amount of change due to problems with SQL hotspots in each project.  To measure the amount of this change, we calculated the proportion of lines changed in each project that contained SQL hotspots.  Using the procedure described in Section 4.3, we used our script to automatically examine each line of code that developers committed to fix security issues. We combined this technique with the script described in Section 4.2 to identify lines of code that developers committed to fix security issues that contained hotspots.  We call the total number of lines of code that developers changed due to security issues Y, and the total number of lines of code within that subset that are also hotspots X.  We calculated the proportion of lines of code changed due to security issues that were also hotspots X divided by Y.
== 4.6. Statistical Analysis ==
We used the R project for statistical computing to perform our statistical analysis of the data in this case study<sup>23</sup>. We used the statistical tests provided by R to determine whether any differences we observed in any two samples occurred by chance or were statistically significant. 
We used the '''Mann-Whitney-Wilcoxon (MWW)''' statistic to perform any population-based comparison between two independent samples, such as between vulnerable and neutral files, or between files that contain hotspots and files that do not.  The MWW test is a non-parametric determination of whether two independent samples of observations have equally large values.  We used a non-parametric statistical test because we cannot assume that the outcomes in our data set are normally distributed.  We also used the '''Chi-Squared Test''' to determine whether there was a statistically significance difference in the proportion of positive outcomes in two population groups.  We also used the '''F Test''' to measure the difference in variance between two sample groups.
== 4.7. Predictive Modeling ==
We built logistic regression models to evaluate the number of hotspots as a predictor of whether or not a file contains any type of vulnerability in each project.  We considered many alternatives for our modeling technique, and compared the precision and recall across all releases for each of the models using Weka<sup>24</sup>. Based on these scores, Weka allowed us to see that logistic regression models consistently outperformed the other choices available in the modeling toolkit provided with Weka. Our model included a term for lines of code (LOC) because intuitively the larger a file is, the more likely a code change will occur in that file.  Coincidentally, the model performed better with the LOC term than without.
Our logistic regression model included only hotspots and lines of code for the independent variables.  Using Weka, we trained the model for each project using the information on vulnerable files from releases 1 to N, and then tested the model on release N+1.  We repeated this process for each of the 15 releases of WordPress and WikkaWiki that information on vulnerabilities for use in training the model, for a total of eight comparisons in WordPress and five comparisons in WikkaWiki<sup>25</sup>.
To evaluate our model, as well as the the ability of hotspots to predict whether a file will be vulnerable in the next release, we do not look at the precision and recall of the model by themselves since these measures give us no idea of how "difficult" the prediction is to make.  A model that has a precision of 80% may seem imperfect, but this model would be far more useful than a model trained by the same data set with precision of 10%.  The same goes for the model's recall.  In light of this fact, we compared our model's precision and recall with a model that randomly assigned files as being vulnerable or neutral.  To do better than random, our model must have better precision and recall than this random guess<sup>[1]</sup>.  In many projects, the percentage of vulnerable files is far fewer than the percentages in our projects<sup>[17]</sup>. We designed our random guess model to assign files as being neutral or vulnerable according to the vulnerability distribution discovered in each project empirically.  That is, we did not assign a "coin toss" guess of vulnerabilities,  where the probability of being vulnerable is p=0.50.  Instead, since the percentage of vulnerable files for WordPress is 26.1%, we created a model which gave the probability of a file being vulnerable as p=0.261. Similarly, for WikkaWiki, the percentage of vulnerable files is 29%, we assigned our model a probability of p=0.29.  We ran the random guess for 10 trials.  The results reported in this paper are the best precision and recall the random guess achieved in our trials.


== 5. Results ==
== 5. Results ==
This section presents the results of our analysis.
{| class="wikitable"
|+Table 2. WordPress Model Performance Hotspots versus Random Guess
! Release
! Hotspot
Model
Precision
! Hotspot
Model
Recall
! Random
Guess
Precision
! Random
Guess
Recall
|-
| style="background: #eeeeee" | 2.0
| 0.50
| 0.10
| 0.14
| 0.10
|-
| style="background: #eeeeee" |  2.1
| 0.38
| 0.13
| 0.20
| 0.17
|-
| style="background: #eeeeee" |  2.2
| 0.43
| 0.32
| 0.23
| 0.26
|-
| style="background: #eeeeee" |  2.3
| 0.28
| 0.21
| 0.11
| 0.17
|-
| style="background: #eeeeee" |  2.5
| 0.19
| 0.18
| 0.04
| 0.05
|-
| style="background: #eeeeee" |  2.6
| 0.12
| 0.40
| 0.00
| 0.00
|-
| style="background: #eeeeee" |  2.7
| 0.31
| 0.40
| 0.09
| 0.07
|-
| style="background: #eeeeee" |  2.8
| 0.02
| 0.17
| 0.00
| 0.00
|}
=== 5.1. Statistical Results and Predictive Modeling ===
For both projects, we performed the statistical tests as described in Section 4.6 to analyze the research hypothesis (H1-H8) described in the beginning of Section 4.  We summarize the results in Table 1. In both projects, we found that the more hotspots a file contains the more likely that file will be vulnerable (H1), and the more changes developers will make to that file due to any type of vulnerability (H2).  We found that issue reports related to input validation vulnerabilities result in a higher average number of repository revisions meaning that input validation vulnerabilities tend to require multiple fixes before the development team considers them fixed (H3).
{| class="wikitable"
|+Table 3. WikkaWiki Model Performance Hotspots versus Random Guess
! Release
! Hotspot
Model
Precision
! Hotspot
Model
Recall
! Random
Guess
Precision
! Random
Guess
Recall
|-
| style="background: #eeeeee" | 1.1.6.1
| 1.00
| 0.15
| 0.13
| 0.07
|-
| style="background: #eeeeee" |  1.1.6.2
| 1.00
| 0.22
| 0.10
| 0.11
|-
| style="background: #eeeeee" |  1.1.6.3
| 1.00
| 0.09
| 0.08
| 0.11
|-
| style="background: #eeeeee" |  1.1.6.4
| 0.08
| 1.00
| 0.00
| 0.00
|-
| style="background: #eeeeee" |  1.1.6.5
| 0.04
| 0.50
| 0.00
| 0.00
|}
We built logistic regression models to evaluate the number of hotspots as a predictor of whether or not a file is vulnerable (H4).  In WordPress, our model had precision between 0.02 and 0.50, and the random guess had precision between 0.0 and 0.23.  Our model had recall between 0.10 and 0.40 and the random guess had recall between 0 and 0.26. Our model had better precision than the random guess in five out of eight cases, and had better recall than the random guess in seven out of eight cases (see Table 2). In WikkaWiki, our model had precision between 0.04 and 1.0, and the random guess had precision between 0.0 and 0.13.  Our model had recall between 0.09 and 1.0 and the random guess had recall between 0.0 and 0.11. Our model had better precision than the random guess in three out of five cases, and had better recall than the random guess in four out of five cases (see Table 3). The values for precision and recall vary because the model's performance changed on each of the 15 versions of the projects we analyzed.  As the model sees more vulnerable files, the model misses less vulnerabilities (higher recall), but also reports more false positives (lower precision) as it relaxes its criteria for choosing a vulnerable file.
The logistic regression model that Weka generated to predict whether files were vulnerable or not, based on number of hotspots and lines of code, consistently contained a positive coefficient for the term representing the number of hotspots for both projects. The positive coefficient indicates that a greater number of hotspots in a file in the current release results in a higher probability of that file containing any type of web application vulnerability (H5).
Based upon these research results, our prioritization heuristic is as follows: ''More SQL and non-SQL vulnerabilities will be found in files that contain more hotspots per line of code. ''
=== 5.2. Comparing the Projects ===
In WordPress, the ability to interact with the database is exposed directly to whichever page is needing access through the <code>$wpdb -> query</code> function
(and similar), which takes the SQL query in question as its parameters, and parses user input in a filtered manner into the query where required. 
In WikkaWiki, however, the developers made a high-level design decision to separate the database concern to a specific source code location.  Specifically, database access is required through a class, called <code>Wakka</code>, which encapsulates the various queries into functions that perform database operations on behalf of the client page.  The SQL hotspots found in WordPress are more spread out among the files in WordPress than in the files in WikkaWiki (H6). The developers of WikkaWiki have chosen to centralize the functionality related to database interaction into a single set of classes that contain all the SQL hotspots.
This set of classes abstracts interactions with the database into a set of semantic methods that hide the direct interaction with the database from the programmer.  These methods provide the necessary implementation required for input validation techniques, such as sanitizing a URL to protect the system from URL manipulation attacks, and stripping any HTML returned to the user for any potentially dangerous tags. Centralizing database interaction also provides the convenience of not having to make changes throughout the code due to a single security problem.  When there is a problem with database interaction in WikkaWiki, developers usually know immediately where to look.  In WordPress, some time and energy may be consumed looking around for the source of the problem.
Our data shows that this high-level design decision coincides with a key difference in the security posture of the two studied projects: WikkaWiki has less input validation vulnerabilities than WordPress.  First, in the predictive models described in Section 5.1, the number of SQL queries in a given file was a better predictor of the vulnerability of that file in WordPress than in WikkaWiki.  But this evidence is further supported by the fact that 29% of the total reported vulnerabilities in WordPress were input validation vulnerabilities, which is a higher proportion than the 18% of total reported vulnerabilities that were input validation vulnerabilities in WikkaWiki (H7).  Finally, we examined the number of changed lines of code due to security vulnerabilities that were hotspots.  In WordPress, this proportion was higher at 3% than WikkaWiki, where the proportion was only 0.7%  (H8).  The implications of H8 for the development team are that hotspots in a project that has utilized a design like the one in WikkaWiki are going to be changed less often because they are easier to maintain.  ''In short, designing a web application with all database interaction and input validation located in a single component can help decrease that applications' proportion of reported input validation vulnerabilities.''


== 6. Limitations ==
== 6. Limitations ==
We can never find or know all vulnerabilities or faults in a given software system.  As such, both WikkaWiki and WordPress will continue to have latent security vulnerabilities that are not included in this analysis. The way that hotspots were defined within the script may also have been incomplete and missed some instances of hotspots that did not resemble the chosen form.  Similarly, the collection of vulnerability reports was analyzed by hand, but the changes due to those reports may have been incorrectly assigned due to the way the issue reports were interpreted by the collection script.  The release number was chosen as a level of granularity, but some files may have only been present for part of a release, and our analysis would miss a vulnerability in these files. There may be some error in our classification of the issue reports. WordPress and WikkaWiki were chosen due to the availability of their issue reports, but there may be some unknown selection bias in our study due to the fact that both projects have such well-documented vulnerability histories and solid contributing developer communities.  Also, our analysis and data gathering are limited to only the two projects we study in this paper.  These results may not be repeatable in other systems with other architectures.  Finally, compared to industrial products, WordPress and WikkaWiki are relatively small in terms of code size.


== 7. Conclusion ==
== 7. Conclusion ==
Hotspots appear to be key in protecting a web application against attacks because we can use prediction based upon hotspots’ locations to target code inspection and testing.  Developers and testers of web applications can use models based upon hotspots to predict where all types of web application vulnerabilities will be in the next release of the system.  Also, testers and V&V teams can prioritize security fortification efforts to place files that these models indicate as likely vulnerable first, thus resulting in a web application with a better security posture. Our prioritization heuristic is as follows: ''More SQL and non-SQL vulnerabilities will be found in files that contain more hotspots per line of code. ''
Input validation vulnerabilities continue to be a prominent problem with no single solution.  However, we have found empirical evidence that separating the concern of database interaction appears to improve the security of an application with respect to the proportion of reported input validation vulnerabilities. Isolating database interaction into a single class has resulted in a lower proportion of input validation vulnerabilities reported on WikkaWiki, and fewer hotspots changed on WikkaWiki due to security issues. Future work should compare design choices like this to further investigate the effect these choices have on the security posture of web applications.


== 8. Acknowledgements ==
== 8. Acknowledgements ==
We would like to thank Andy Meneely for his guidance on the empirical data collection as well as the statistical analysis for this paper. We would also like to thank Yonghee Shin for introducing the notion of using SQL hotspots as an internal metric.  This work is supported by the National Science Foundation under CAREER Grant No. 0346903. Any opinions expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


== 9. References ==
== 9. References ==
Line 162: Line 390:
# http://wush.net/trac/wikka/ticket/293
# http://wush.net/trac/wikka/ticket/293
# http://core.trac.wordpress.org/ticket/2041
# http://core.trac.wordpress.org/ticket/2041
# CWE-20, CWE-79, CWE-89, CWE-77, CWE-22, CWE-74, CWE-226, CWE-138, CWE-212, CWE-150
# http://www.r-project.org/, Version 2.9.2.
# http://www.cs.waikato.ac.nz/ml/weka/
# With two (N) datasets, a researcher can only make one (N-1) comparison.
[[Category:Conference Papers]]