Using SQL Hotspots in a Prioritization Heuristic for Detecting All Types of Web Application Vulnerabilities: Difference between revisions

No edit summary
 
(10 intermediate revisions by the same user not shown)
Line 62: Line 62:


== 4. Methodology ==
== 4. Methodology ==
{| class="wikitable" style="text-align: left; width: 100%;"
|+ Table 1. Results per Project
!
! WordPress
! WikkaWiki
|-
|Releases Analysed
|Nine
|Six
|-
| Security issue reports analyzed
| 97
| 61
|-
| Vulnerable files (over project's history)
| 26% (85 / 326)
| 29% (44 / 209)
|-
| Average number of hotspots (over project's history
| 255
| 92
|-
| Average percent of files having at least one hotspot
| 14.2%
| 8.42%
|-
|colspan="3" style="background: #eeeeee" | '''Hypotheses† about files'''
|-
| '''H1.''' The more hotspots a file contains per line of code, the more likely it is that the file contains any web application vulnerability.
| True (Logistic Regression, p<0.05)
| True (Logistic Regression, p<0.05)
|-
| '''H2.''' The more hotspots a file contains, the more times that file was changed due to any kind of vulnerability (not just input validation vulnerabilities).
| True (Simple Linear Regression, p<0.0001, Adjusted R2 = 0.4208)
| True (Simple Linear Regression, p<0.0001, Adjusted R2 = 0.3802)
|-
|colspan="3" style="background: #eeeeee" | '''Hypotheses about issue reports'''
|-
| '''H3'''. Input validation vulnerabilities result in a higher number average repository revisions than any other type of vulnerability*.
| True (MWW, p<0.05)
| True (MWW, p<0.05)
|-
|colspan="3" style="background: #eeeeee" | '''Hypotheses about prediction'''
|-
| '''H4.''' Hotspots can be used to predict files that will contain any type of web application vulnerability in the current release.
| True (Predictive Modeling, see Table 2)
| True (Predictive Modeling, see Table 3)
|-
| '''H5.''' The more hotspots a file contains, the more likely that file will be vulnerable in the next release.
| True (Positive Coefficient on Predictive Models)
| True (Positive Coefficient on Predictive Models)
|-
|colspan="3" style="background: #eeeeee" | '''Hypotheses comparing projects'''
|-
| '''H6.''' The average number of hotspots per file is more variable in WordPress than in WikkaWikki.
| colspan=2 | True (F-test, p<0.000001)
|-
| '''H7.''' WordPress suffered a higher proportion of input validation vulnerabilities than WikkaWiki.
| colspan=2 | True (Chi-Squared, p=0.0692)
|-
| '''H8.''' In WordPress, more of the lines of code that were changed due to security issues were hotspots.
| colspan=2 | True (Chi-Square, p<0.00001)
|-
| colspan=3 style="border-style: solid; border-width: 0 1px 1px 0" | *This finding is consistent with the report from SANS (see Section 1) that indicates that the most popular types of web application attacks are input validation vulnerabilities.
&dagger;Please note that we use the term "hypothesis" in this table with respect to scientific hypotheses and not statistical hypotheses.
|}


We conducted two case studies to empirically investigate eight hypothesis related to hotspot source code locations and vulnerabilities reported in the systems' bug tracking systems.  We present these hypotheses, as well their results, in Table 1. We will further explain the results in Section 5.  Our hypotheses point to the research objective: to improve the prioritization of security fortification efforts by investigating the ability of SQL hotspots to be used as the basis for a heuristic for the prediction of all vulnerability types.  We also include lines of code in our analysis as a way of improving the accuracy and predictive power of our heuristic along with SQL hotspots.  Specifically, we look at the relationship between hotspots and files (H1-H2), the amount of code change as related to the vulnerability type (H3), the predictive ability of hotspots for any vulnerability type (H4-H5), and the effect that collocating hotspots can have on the number and types of vulnerability in a given system (H6-H8).
We conducted two case studies to empirically investigate eight hypothesis related to hotspot source code locations and vulnerabilities reported in the systems' bug tracking systems.  We present these hypotheses, as well their results, in Table 1. We will further explain the results in Section 5.  Our hypotheses point to the research objective: to improve the prioritization of security fortification efforts by investigating the ability of SQL hotspots to be used as the basis for a heuristic for the prediction of all vulnerability types.  We also include lines of code in our analysis as a way of improving the accuracy and predictive power of our heuristic along with SQL hotspots.  Specifically, we look at the relationship between hotspots and files (H1-H2), the amount of code change as related to the vulnerability type (H3), the predictive ability of hotspots for any vulnerability type (H4-H5), and the effect that collocating hotspots can have on the number and types of vulnerability in a given system (H6-H8).
Line 132: Line 200:


This section presents the results of our analysis.
This section presents the results of our analysis.
{| class="wikitable"
|+Table 2. WordPress Model Performance Hotspots versus Random Guess
! Release
! Hotspot
Model
Precision
! Hotspot
Model
Recall
! Random
Guess
Precision
! Random
Guess
Recall
|-
| style="background: #eeeeee" | 2.0
| 0.50
| 0.10
| 0.14
| 0.10
|-
| style="background: #eeeeee" |  2.1
| 0.38
| 0.13
| 0.20
| 0.17
|-
| style="background: #eeeeee" |  2.2
| 0.43
| 0.32
| 0.23
| 0.26
|-
| style="background: #eeeeee" |  2.3
| 0.28
| 0.21
| 0.11
| 0.17
|-
| style="background: #eeeeee" |  2.5
| 0.19
| 0.18
| 0.04
| 0.05
|-
| style="background: #eeeeee" |  2.6
| 0.12
| 0.40
| 0.00
| 0.00
|-
| style="background: #eeeeee" |  2.7
| 0.31
| 0.40
| 0.09
| 0.07
|-
| style="background: #eeeeee" |  2.8
| 0.02
| 0.17
| 0.00
| 0.00
|}


=== 5.1. Statistical Results and Predictive Modeling ===
=== 5.1. Statistical Results and Predictive Modeling ===


For both projects, we performed the statistical tests as described in Section 4.6 to analyze the research hypothesis (H1-H8) described in the beginning of Section 4.  We summarize the results in Table 1. In both projects, we found that the more hotspots a file contains the more likely that file will be vulnerable (H1), and the more changes developers will make to that file due to any type of vulnerability (H2).  We found that issue reports related to input validation vulnerabilities result in a higher average number of repository revisions meaning that input validation vulnerabilities tend to require multiple fixes before the development team considers them fixed (H3).
For both projects, we performed the statistical tests as described in Section 4.6 to analyze the research hypothesis (H1-H8) described in the beginning of Section 4.  We summarize the results in Table 1. In both projects, we found that the more hotspots a file contains the more likely that file will be vulnerable (H1), and the more changes developers will make to that file due to any type of vulnerability (H2).  We found that issue reports related to input validation vulnerabilities result in a higher average number of repository revisions meaning that input validation vulnerabilities tend to require multiple fixes before the development team considers them fixed (H3).
{| class="wikitable"
|+Table 3. WikkaWiki Model Performance Hotspots versus Random Guess
! Release
! Hotspot
Model
Precision
! Hotspot
Model
Recall
! Random
Guess
Precision
! Random
Guess
Recall
|-
| style="background: #eeeeee" | 1.1.6.1
| 1.00
| 0.15
| 0.13
| 0.07
|-
| style="background: #eeeeee" |  1.1.6.2
| 1.00
| 0.22
| 0.10
| 0.11
|-
| style="background: #eeeeee" |  1.1.6.3
| 1.00
| 0.09
| 0.08
| 0.11
|-
| style="background: #eeeeee" |  1.1.6.4
| 0.08
| 1.00
| 0.00
| 0.00
|-
| style="background: #eeeeee" |  1.1.6.5
| 0.04
| 0.50
| 0.00
| 0.00
|}


We built logistic regression models to evaluate the number of hotspots as a predictor of whether or not a file is vulnerable (H4).  In WordPress, our model had precision between 0.02 and 0.50, and the random guess had precision between 0.0 and 0.23.  Our model had recall between 0.10 and 0.40 and the random guess had recall between 0 and 0.26. Our model had better precision than the random guess in five out of eight cases, and had better recall than the random guess in seven out of eight cases (see Table 2). In WikkaWiki, our model had precision between 0.04 and 1.0, and the random guess had precision between 0.0 and 0.13.  Our model had recall between 0.09 and 1.0 and the random guess had recall between 0.0 and 0.11. Our model had better precision than the random guess in three out of five cases, and had better recall than the random guess in four out of five cases (see Table 3). The values for precision and recall vary because the model's performance changed on each of the 15 versions of the projects we analyzed.  As the model sees more vulnerable files, the model misses less vulnerabilities (higher recall), but also reports more false positives (lower precision) as it relaxes its criteria for choosing a vulnerable file.
We built logistic regression models to evaluate the number of hotspots as a predictor of whether or not a file is vulnerable (H4).  In WordPress, our model had precision between 0.02 and 0.50, and the random guess had precision between 0.0 and 0.23.  Our model had recall between 0.10 and 0.40 and the random guess had recall between 0 and 0.26. Our model had better precision than the random guess in five out of eight cases, and had better recall than the random guess in seven out of eight cases (see Table 2). In WikkaWiki, our model had precision between 0.04 and 1.0, and the random guess had precision between 0.0 and 0.13.  Our model had recall between 0.09 and 1.0 and the random guess had recall between 0.0 and 0.11. Our model had better precision than the random guess in three out of five cases, and had better recall than the random guess in four out of five cases (see Table 3). The values for precision and recall vary because the model's performance changed on each of the 15 versions of the projects we analyzed.  As the model sees more vulnerable files, the model misses less vulnerabilities (higher recall), but also reports more false positives (lower precision) as it relaxes its criteria for choosing a vulnerable file.
Line 158: Line 338:


== 7. Conclusion ==
== 7. Conclusion ==
Hotspots appear to be key in protecting a web application against attacks because we can use prediction based upon hotspots’ locations to target code inspection and testing.  Developers and testers of web applications can use models based upon hotspots to predict where all types of web application vulnerabilities will be in the next release of the system.  Also, testers and V&V teams can prioritize security fortification efforts to place files that these models indicate as likely vulnerable first, thus resulting in a web application with a better security posture. Our prioritization heuristic is as follows: ''More SQL and non-SQL vulnerabilities will be found in files that contain more hotspots per line of code. ''
Input validation vulnerabilities continue to be a prominent problem with no single solution.  However, we have found empirical evidence that separating the concern of database interaction appears to improve the security of an application with respect to the proportion of reported input validation vulnerabilities. Isolating database interaction into a single class has resulted in a lower proportion of input validation vulnerabilities reported on WikkaWiki, and fewer hotspots changed on WikkaWiki due to security issues. Future work should compare design choices like this to further investigate the effect these choices have on the security posture of web applications.


== 8. Acknowledgements ==
== 8. Acknowledgements ==
We would like to thank Andy Meneely for his guidance on the empirical data collection as well as the statistical analysis for this paper. We would also like to thank Yonghee Shin for introducing the notion of using SQL hotspots as an internal metric.  This work is supported by the National Science Foundation under CAREER Grant No. 0346903. Any opinions expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


== 9. References ==
== 9. References ==
Line 208: Line 394:
# http://www.cs.waikato.ac.nz/ml/weka/
# http://www.cs.waikato.ac.nz/ml/weka/
# With two (N) datasets, a researcher can only make one (N-1) comparison.
# With two (N) datasets, a researcher can only make one (N-1) comparison.
[[Category:Conference Papers]]