Using SQL Hotspots in a Prioritization Heuristic for Detecting All Types of Web Application Vulnerabilities: Difference between revisions

No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 217: Line 217:
Recall
Recall
|-
|-
| 2.0
| style="background: #eeeeee" | 2.0
| 0.50
| 0.50
| 0.10
| 0.10
Line 223: Line 223:
| 0.10
| 0.10
|-
|-
| 2.1
| style="background: #eeeeee" |  2.1
| 0.38
| 0.38
| 0.13
| 0.13
Line 229: Line 229:
| 0.17
| 0.17
|-
|-
| 2.2
| style="background: #eeeeee" |  2.2
| 0.43
| 0.43
| 0.32
| 0.32
Line 235: Line 235:
| 0.26
| 0.26
|-
|-
| 2.3
| style="background: #eeeeee" |  2.3
| 0.28
| 0.28
| 0.21
| 0.21
Line 241: Line 241:
| 0.17
| 0.17
|-
|-
| 2.5
| style="background: #eeeeee" |  2.5
| 0.19
| 0.19
| 0.18
| 0.18
Line 247: Line 247:
| 0.05
| 0.05
|-
|-
| 2.6
| style="background: #eeeeee" |  2.6
| 0.12
| 0.12
| 0.40
| 0.40
Line 253: Line 253:
| 0.00
| 0.00
|-
|-
| 2.7
| style="background: #eeeeee" |  2.7
| 0.31
| 0.31
| 0.40
| 0.40
Line 259: Line 259:
| 0.07
| 0.07
|-
|-
| 2.8
| style="background: #eeeeee" |  2.8
| 0.02
| 0.02
| 0.17
| 0.17
Line 269: Line 269:


For both projects, we performed the statistical tests as described in Section 4.6 to analyze the research hypothesis (H1-H8) described in the beginning of Section 4.  We summarize the results in Table 1. In both projects, we found that the more hotspots a file contains the more likely that file will be vulnerable (H1), and the more changes developers will make to that file due to any type of vulnerability (H2).  We found that issue reports related to input validation vulnerabilities result in a higher average number of repository revisions meaning that input validation vulnerabilities tend to require multiple fixes before the development team considers them fixed (H3).
For both projects, we performed the statistical tests as described in Section 4.6 to analyze the research hypothesis (H1-H8) described in the beginning of Section 4.  We summarize the results in Table 1. In both projects, we found that the more hotspots a file contains the more likely that file will be vulnerable (H1), and the more changes developers will make to that file due to any type of vulnerability (H2).  We found that issue reports related to input validation vulnerabilities result in a higher average number of repository revisions meaning that input validation vulnerabilities tend to require multiple fixes before the development team considers them fixed (H3).
{| class="wikitable"
|+Table 3. WikkaWiki Model Performance Hotspots versus Random Guess
! Release
! Hotspot
Model
Precision
! Hotspot
Model
Recall
! Random
Guess
Precision
! Random
Guess
Recall
|-
| style="background: #eeeeee" | 1.1.6.1
| 1.00
| 0.15
| 0.13
| 0.07
|-
| style="background: #eeeeee" |  1.1.6.2
| 1.00
| 0.22
| 0.10
| 0.11
|-
| style="background: #eeeeee" |  1.1.6.3
| 1.00
| 0.09
| 0.08
| 0.11
|-
| style="background: #eeeeee" |  1.1.6.4
| 0.08
| 1.00
| 0.00
| 0.00
|-
| style="background: #eeeeee" |  1.1.6.5
| 0.04
| 0.50
| 0.00
| 0.00
|}


We built logistic regression models to evaluate the number of hotspots as a predictor of whether or not a file is vulnerable (H4).  In WordPress, our model had precision between 0.02 and 0.50, and the random guess had precision between 0.0 and 0.23.  Our model had recall between 0.10 and 0.40 and the random guess had recall between 0 and 0.26. Our model had better precision than the random guess in five out of eight cases, and had better recall than the random guess in seven out of eight cases (see Table 2). In WikkaWiki, our model had precision between 0.04 and 1.0, and the random guess had precision between 0.0 and 0.13.  Our model had recall between 0.09 and 1.0 and the random guess had recall between 0.0 and 0.11. Our model had better precision than the random guess in three out of five cases, and had better recall than the random guess in four out of five cases (see Table 3). The values for precision and recall vary because the model's performance changed on each of the 15 versions of the projects we analyzed.  As the model sees more vulnerable files, the model misses less vulnerabilities (higher recall), but also reports more false positives (lower precision) as it relaxes its criteria for choosing a vulnerable file.
We built logistic regression models to evaluate the number of hotspots as a predictor of whether or not a file is vulnerable (H4).  In WordPress, our model had precision between 0.02 and 0.50, and the random guess had precision between 0.0 and 0.23.  Our model had recall between 0.10 and 0.40 and the random guess had recall between 0 and 0.26. Our model had better precision than the random guess in five out of eight cases, and had better recall than the random guess in seven out of eight cases (see Table 2). In WikkaWiki, our model had precision between 0.04 and 1.0, and the random guess had precision between 0.0 and 0.13.  Our model had recall between 0.09 and 1.0 and the random guess had recall between 0.0 and 0.11. Our model had better precision than the random guess in three out of five cases, and had better recall than the random guess in four out of five cases (see Table 3). The values for precision and recall vary because the model's performance changed on each of the 15 versions of the projects we analyzed.  As the model sees more vulnerable files, the model misses less vulnerabilities (higher recall), but also reports more false positives (lower precision) as it relaxes its criteria for choosing a vulnerable file.
Line 347: Line 394:
# http://www.cs.waikato.ac.nz/ml/weka/
# http://www.cs.waikato.ac.nz/ml/weka/
# With two (N) datasets, a researcher can only make one (N-1) comparison.
# With two (N) datasets, a researcher can only make one (N-1) comparison.
[[Category:Conference Papers]]