A Machine Learning Approach to Phishing Detection and Defense (2015)
Chapter 5. Implementation and Result
Abstract
This chapter discusses the process of implementation and the results obtained. First, a recap of the objective is discussed and the process of attaining this objective is also discussed in brief. In addition, an overview of the investigation is illustrated with a flowchart introducing the scope of the chapter. Second, we give a detailed experimental setup for our implementation – training and testing model that includes the key parameters used in our implementation, the performance of each classifier using the same performance metrics, and the design of our ensemble classifier including the method of voting used in selecting the best ensemble classifier. Interestingly, because of the voting method used for our ensemble, odd number of classifiers must be used for the ensemble design and as such, we selected three classifiers from the pool of four classifiers giving a total of four ensembles. In addition, results of each classifier is tabulated and illustrated with graphs. Third, a comparative study of the two techniques used is discussed by comparing their accuracies. In concluding the chapter, we have given a summary of the chapter.
Keywords
classifier
ensemble, detection
accuracy
performance metrics
false alarm
voting
5.1. Introduction
The process of dataset processing, feature selection, and dataset division was presented in Chapter 4. This chapter addresses the problem of selecting the best classification technique for website phishing detection that causes degradation in detection accuracy and high false alarm rate. The main objective of this chapter is to train and test the individual reference classifiers (C5.0, LR, KNN, and SVM) with the same dataset, design an ensemble of the reference classifiers, and compare the ensemble classifier performance with the best single classifier performance to choose the better of the two performances to overcome the low classification rate in website phishing detection. One of the major contributing factors to low overall accuracy is the selection of weak weighted features for classification. The situation worsens when a lazy algorithm is trained and tested with a large dataset. Therefore, the performance of the research methodology used in this project may not perform so well if the wrong classifier is trained and tested with dataset size more than the classifier’s capacity.
The solution and research activities discussed in this chapter are identified in Phase 2 and Phase 3 in the overall research plan in Chapter 3. The chapter begins with an overview of the investigation and then followed by details of investigations on training and testing model for the reference classifiers. Furthermore, the operational procedure and algorithm for the investigated models are provided. Performance metrics are presented in terms of detection accuracy, precision, recall, and F-score. Finally, overall discussion on the result and a summary concludes the chapter.
5.2. An overview of the investigation
The investigation in this chapter can be divided into three main parts, namely training and testing, design ensemble, and finally comparative solution. Figure 5.1 shows an overview of the investigations conducted in this chapter.
FIG. 5.1 An overview of the investigation towards selecting the best classification technique for website phishing detection.
Training and testing model refers to the procedure involved in learning the algorithms with part of the dataset and testing the performance of the algorithms in correctly classifying the dataset. Meanwhile, design ensemble refers to the process of taking the output classifier performance of individual classifier in ensemble. Although, there are different ways of classifier ensemble, majority voting was used for this process. This is based on the assumption that the error rate of each classifier is less than 0.5 and errors made by classifiers are uncorrelated. Hence, the probability that the ensemble classifier makes a wrong prediction is considerably low. Finally, the results obtained for both individual classifier and ensemble design were compared.
5.2.1. Experimental Setup
Experiments were performed with 4061 (1638 non-phishing = 31% and 3611 phishing = 69%) instances of which all were manually classified as phishing or non-phishing. After selecting all the attributes (nine regular attributes, one binominal class label) the last column of data represents whether the URL is considered phishing (1) or not (0). The attributes indicates if there are any irregular patterns in the classification results obtained during implementation of the algorithms. All the nine attributes’ statistical measure and definition are as follows:
1. Continuous real [0, 1] attributes of type attributes
2. Two continuous real [0, −1] attributes of type attributes
3. One binominal (0, 1) class attribute of type phish = denotes whether the URL was considered phishing (1) or not (0).
4. Three datasets are used. This datasets are named Set A, Set B, and Set C as discussed in dataset division section in Chapter 4.
5.3. Training and testing model (baseline model)
Training and testing model is also termed as baseline model in this book. This model serves as a baseline for selecting the best ensemble classifier discussed in Section 5.4. Furthermore the baseline model output serve as one of the input for the process discussed inSection 5.5. Figure 5.2 shows the procedure of training and testing model.
FIG. 5.2 Procedure for training and testing model (baseline).
In this design, the “retrieve dataset” process will retrieve the one of the three datasets at a time and pass it over to the “training and validation” process where x-validation used and the model applied for training. The most important component of this model are the reference classifiers used for each loop from the “performance metric” to “training and validation.” Also, the “performance metric” loop back to “retrieve dataset” after every complete rotation of obtaining performance metrics until all the three datasets have been passed through the model.
In order to successfully carry out the training and testing process, some parameters are used to achieve the best result. These parameters are defined in Table 5.1.
Table 5.1
Key Parameters Values Used in Training and Testing Process
Parameter |
Value/Quantity |
Description |
K |
1 |
Finding the k training examples that are closest to the unseen example is the first step of the k-NN algorithm |
Sampling type |
Stratified sampling |
Builds random subsets and ensures that the class distribution in the subsets is the same as in the whole reference dataset |
No. of validations |
10 |
Size of testing set used |
Performance (binomia1 classification) |
Main criterion (accuracy, precision, recall, F-measure) |
This operator is used for statistical performance evaluation of binominal classification tasks |
K nearest neighbor algorithm has been studied for different number of neighbors. The output result is shown in Table 5.2. A key point in this filter is growth in fault occurrence according to increase in the number of Neighbors. The presented results show that, since the number of samples in each class is not balanced, decrease in the number of Neighbors may improve the result. In addition, Table 5.3 shows the resulting confusion matrix from K Nearest Neighbor which is then used to select the closest Nearest Neighbor. K-NN1 shows the best result. Therefore, K-NN1 is used in the implementation shown in further section.
Table 5.2
Output Result of Different Number of Neighbors
K-NN using 10 x-Validation |
|||||||
Metrics |
K-NN1 |
K-NN2 |
K-NN3 |
K-NN4 |
K-NN5 |
K-NN6 |
K-NN7 |
Accuracy |
99.37% |
99.16% |
99.20% |
98.69% |
98.57% |
95. 86% |
98.97% |
Precision |
99.76% |
99.76% |
99.43% |
99.43% |
99.27% |
99.67% |
99.59% |
Recall |
99.35% |
99.18% |
99.43% |
98.69% |
98.69% |
98.69% |
98.94% |
F Score |
99.55% |
99.47% |
99.43% |
99.06% |
98.98% |
99.18% |
99.26% |
Table 5.3
Confusion Matrix Resulted from K Nearest Neighbor
K = l |
K = 2 |
K = 3 |
||||
Real Classes |
Phishing |
Non-phishing |
Phishing |
Non-phishing |
Phishing |
Non-phishing |
Phishing |
99.43% |
0.22% |
99.43% |
1.92% |
98.67% |
1.33% |
Non-phishing |
0.55% |
9935% |
0.55% |
99.18% |
1.28% |
99.43% |
Another key parameter used in training and testing process is the “sampling type.” In this implementation, stratified sampling was chosen because the variable type of the dataset used is set to binomial. Figure 5.3 shows the pseudocode for stratified sampling type.
FIG. 5.3 Pseudocode for Stratified sampling type.
During implementation of this phase, different parameters have been used to train and test the classifiers in order to justify the parameters used. A major parameter alternated several times during the initial process is the number of validations used as described in Figure 5.1.
Meanwhile, after using nine different validation number from 10 to 90 such that x = [10, 20, 30, ..,90] and the standard deviation of the results examined, it was concluded that because of the insignificance of the standard deviation value, any of the results can be used. Tables 5.4–5.7 shows the accuracy, precision, recall, and f-measure respectively, of the reference classifiers showing the average and standard deviation of all the nine validation number used whereas Figures 5.4–5.7 show the plot of average against standard deviation of accuracy, precision, recall, and f-measure, respectively.
Table 5.4
Accuracy Results for Validation Numbers Used Respectively
CV |
C4.5 |
LR |
K-NN1 |
K-NN2 |
SVM |
10 |
99.09% |
99.03% |
99.37% |
99.26% |
99.03% |
20 |
99.08% |
99.03% |
99.37% |
99.26% |
97.88% |
30 |
98.97% |
99.03% |
99.37% |
99.26% |
99.03% |
40 |
98.97% |
99.03% |
99.37% |
99.26% |
99.03% |
50 |
99.03% |
99.03% |
99.37% |
99.26% |
99.03% |
60 |
98.98% |
99.03% |
99.37% |
99.26% |
98.80% |
70 |
99.09% |
99.03% |
99.37% |
99.26% |
98.63% |
80 |
98.97% |
99.03% |
99.43% |
99.32% |
99.03% |
90 |
99.03% |
99.03% |
99.37% |
99.25% |
98.62% |
AVG |
99.02% |
99.03% |
99.38% |
99.27% |
98.79% |
STD |
0.0005011 |
2.2204E-16 |
0.00018856 |
0.000195 |
0.00360648 |
Table 5.5
Precision Results for Validation Numbers Used Respectively
CV |
C4.5 |
LR |
K-NN1 |
K-NN2 |
SYM |
10 |
99.75% |
99.92% |
99.76% |
99.76% |
99.92% |
20 |
99.76% |
99.92% |
99.76% |
99.76% |
99.83% |
30 |
99.68% |
99.92% |
99.76% |
99.76% |
99.92% |
40 |
99.68% |
99.92% |
99.76% |
99.76% |
99.92% |
50 |
99.76% |
99.92% |
99.76% |
99.76% |
99.92% |
60 |
99.68% |
99.92% |
99.76% |
99.76% |
99.92% |
70 |
99.92% |
99.52% |
99.76% |
99.76% |
99.92% |
80 |
99.77% |
99.92% |
99.77% |
99.77% |
99.92% |
90 |
99.77% |
99.93% |
99.77% |
99.77% |
99.93% |
AVG |
99. 75% |
99.88% |
99. 76% |
99. 76% |
99.91% |
STD |
0.00070361 |
0.00126139 |
4.1S74E-05 |
4.15-T4E-05 |
0.00028846 |
Table 5.6
Recall Results for Validation Numbers Used Respectively
CV |
C4.5 |
LR |
K-NN 1 |
K-NN 2 |
SVM |
10 |
98.94% |
98.69% |
99.35% |
99.18% |
98.69% |
20 |
98.94% |
98.70% |
99.35% |
99.18% |
97.14% |
30 |
98.85% |
98.69% |
99.35% |
99.18% |
98.69% |
40 |
98.86% |
98.70% |
99.35% |
99.19% |
98.70% |
50 |
98.86% |
98.69% |
99.35% |
99.19% |
98.69% |
60 |
98.87% |
98.70% |
99.36% |
99.19% |
98.38% |
70 |
98.76% |
98.67% |
99.33% |
99.17% |
98.08% |
80 |
98.77% |
98.69% |
99.43% |
99.27% |
98.69% |
90 |
98.85% |
98.68% |
99.34% |
99.18% |
98.08% |
AVG |
98.86% |
98.69% |
99.36% |
99.19% |
98.35% |
STD |
0.0005871 |
9.4281E-05 |
0.0002708 |
0.00028197 |
0.00493929 |
Table 5.7
F-Measure Results for Validation Numbers Used Respectively
CV |
C4.5 |
LR |
K-NN1 |
K-NN2 |
SVM |
10 |
99.34% |
99.30% |
99.55% |
99.47% |
99.30% |
20 |
99.34% |
99.30% |
99.55% |
99.47% |
98.32% |
30 |
99.25% |
99.30% |
99.55% |
99.46% |
99.30% |
40 |
99.26% |
99.29% |
99.55% |
99.46% |
99.29% |
50 |
99.30% |
99.29% |
99.55% |
99.47% |
99.29% |
60 |
99.25% |
99.29% |
99.55% |
99.46% |
99.12% |
70 |
99.31% |
99.30% |
99.53% |
99.45% |
98.90% |
80 |
99.24% |
99.28% |
99.58% |
99.50% |
99.28% |
90 |
99.28% |
99.28% |
99.54% |
99.47% |
98.87% |
AVG |
99.29% |
99.29% |
99.55% |
99.47% |
99.07% |
STD |
0.00036549 |
7.8567E-05 |
0.00012472 |
0.00013147 |
0.00312769 |
FIG. 5.4 Overall average and standard deviation for accuracy.
FIG. 5.5 Overall average and standard deviation for precision.
FIG. 5.6 Overall average and standard deviation for recall.
FIG. 5.7 Overall average and standard deviation for F-measure.
Looking at the accuracy of K-NN1 and K-NN2 shown in Table 5.4, it is obvious to conclude that K-NN1 performs better than K-NN2 and as such K-NN1 is chosen over K-NN2 in the further implementation phases discussed later on in this chapter. Based on the justification discussed for number of validation, each of the reference algorithms was trained and tested across the three sets of dataset and the resulting output of this process is shown in Tables 5.8–5.11. Corresponding charts of the result obtained are shown inFigures 5.8–5.11.
Table 5.8
Accuracy of Individual Classifier in Varying Dataset
Individual Technique Accuracy |
||||
SET |
C4.5 |
LR |
KNN |
SVM |
A |
99.14% |
99.03% |
99.37% |
99.03% |
B |
99.31% |
99.31% |
99.31% |
99.31% |
C |
99.26% |
99.26% |
98.80% |
99.26% |
Table 5.9
Precision of Individual Classifier in Varying Dataset
Individual Technique Precision |
||||
SET |
C4.5 |
LR |
KNN |
SVM |
A |
99.92% |
99.92% |
99.76% |
99.92% |
B |
99.88% |
99.88% |
99.66% |
99.88% |
C |
98.51% |
98.51% |
98.66% |
98.51% |
Table 5.10
Recall of Individual Classifier in Varying Dataset
Individual Technique Recal |
||||
SET |
C4.5 |
LR |
KNN |
SVM |
A |
98.86% |
98.69% |
99.35% |
98.69% |
B |
98.74% |
98.74% |
98.97% |
98.74% |
C |
99.05% |
99.05% |
97.34% |
99.05% |
Table 5.11
F-Score of Individual Classifier in Varying Dataset
Individual Technique F-score |
||||
SET |
C4.5 |
LR |
KNN |
SVM |
A |
99.38% |
99.30% |
99.55% |
99.30% |
B |
99.31% |
99.31% |
99.31% |
99.31% |
C |
98.76% |
98.76% |
97.98% |
99.31% |
FIG. 5.8 Plot of accuracy across varying dataset.
FIG. 5.9 Plot of precision across varying dataset.
FIG. 5.10 Plot of recall across varying dataset.
FIG. 5.11 Plot of f-measure across varying dataset.
Scrutinizing the results obtained from individual classifier performance across the varying dataset used, it was observed that K-NN perform best with Set A based on accuracy and f-measure. Perhaps, considering both precision and recall may give a confusing interpretation to the results without considering the f-measure which is the harmonic mean of combined precision and recall. Therefore, investigating the f-measure of individual classifiers across varying dataset as shown in Table 5.11, it is obvious that K-NN f-measure is the highest at 99.55%. Hence, the best performed classifier out of all the reference classifiers is chosen as K-NN (Table 5.12).
Table 5.12
Best Performed Individual Classifier
SET A |
K-NN |
Accuracy |
99.37% |
Precision |
99.76% |
Recall |
99.35% |
F Score |
99.55% |
It can be observed in Table 5.13 that the false alarm rate of K-NN is substantially small and as such considered a good classifier since the output result shows that the algorithm correctly classified most of the instances and a false negative of 0.800 is the resulting error rate obtained.
Table 5.13
False Alarm Rate of K-NN
False Negative: 0.800 +/− 0.600 |
|||
True 1 |
True 0 |
Class Precision |
|
Pred. 1 |
522 |
8 |
98.49% |
Pred. 0 |
3 |
1217 |
99.75% |
Class recall |
99.43% |
99.35% |
Figure 5.12 shows the ROC curve for K-NN. This shows the plot of true positive rate and false positive rate in order to provide a principled mechanism to explore operating point tradeoffs. The ROC obtained is 0.500.
FIG. 5.12 ROC curve for K-NN.
5.4. Ensemble design and voting scheme
Experiments using varying dataset was conducted in Section 5.3 and based on the output of this conduct, the committee of ensemble was designed. The ensemble algorithm chosen was the simple majority voting algorithm, for this reason an odd number of constituent classifiers was required. From the pool of four classifiers, all sets of classifiers of size three were chosen for ensembles. This meant that there were a total of four classifier ensembles. The components of these are summarized in Table 5.14. These ensembles were evaluated using the same metrics as the individual techniques in Section 5.3. Tables 5.15–5.17 show the results obtained for the four ensembles which are further illustrated in the charts shown in Figures 5.13–5.15. These results are compared to those in Section 5.3 and presented in a later section.
Table 5.14
Ensemble Components
Ensemble |
Algl |
Alg2 |
Alg3 |
Ensemble 1 |
KNN |
C4.5 |
LR |
Ensemble 2 |
KNN |
C4.5 |
SVM |
Ensemble 3 |
KNN |
LR |
SVM |
Ensemble 4 |
C4.5 |
LR |
SVM |
Table 5.15
Ensemble Result Using the SET A
SET A |
ENS1 |
ENS2 |
ENS3 |
ENS4 |
Accuracy |
99.20% |
99.20% |
99.03% |
99.03% |
Precision |
99.92% |
99.92% |
99.92% |
99.92% |
Recall |
98.94% |
98.94% |
98.69% |
98.69% |
F Score |
99.42% |
99.42% |
99.30% |
99.30% |
Table 5.16
Ensemble Result Using the SET B Dataset
SET B |
ENS1 |
ENS2 |
ENS3 |
ENS4 |
Accuracy |
99.31% |
99.31% |
99.31% |
99.31% |
Precision |
99.88% |
99.88% |
99.88% |
99.88% |
Recall |
98.74% |
98.74% |
98.74% |
98.74% |
F Score |
99.31% |
99.31% |
99.31% |
99.31% |
Table 5.17
Ensemble Result Using the SET C Dataset
SET C |
ENS1 |
ENS2 |
ENS3 |
ENS4 |
Accuracy |
99.26% |
99.26% |
99.26% |
99.26% |
Precision |
98.51% |
98.51% |
98.51% |
98.51% |
Recall |
99.05% |
99.05% |
99.05% |
99.05% |
F Score |
98.76% |
98.76% |
98.76% |
98.76% |
FIG. 5.13 Plot of performance metric and ensembles across SET A.
FIG. 5.14 Plot of performance metric and ensembles across SET B.
FIG. 5.15 Plot of performance metric and ensembles across SET C.
Based on the result obtained from ensemble, it becomes obvious that all the ensembles performed equally in Set B and also the results obtained are the best of the three datasets. Also, this testifies that all the ensembles perform best when the dataset is equally divided between phishing and non-phishing. Since all the ensembles have the same result when Set B dataset is used then it can be concluded that any of these ensembles can be used. Table 5.18 shows the accuracy obtained from the ensembles with varying dataset. A plot of accuracy of the ensembles across the varying dataset is shown in Figure 5.15.
Table 5.18
Accuracy of Individual Ensemble Across Varying Dataset
Dataset |
ENS1 |
ENS2 |
ENS3 |
ENS4 |
SET A |
99.20% |
99.20% |
99.03% |
99 03% |
SET B |
99.31% |
99.31% |
99.31% |
99.31% |
SET C |
99.26% |
99.26% |
99.26% |
99.26% |
From the graph shown in Figure 5.16, it can be seen that in both Set B and Set C, the accuracy of the ensembles is the same and in Set A, the last two ensembles (ENS3 and ENS4) diverged in accuracy as compared to the first two ensembles (ENS1 and ENS2). This sudden drop in accuracy is due to the weak performance of LR and SVM as compared to C4.5 and K-NN as discussed in the previous section. Meanwhile, the results obtained from ensemble using Set B shows the same value for all the ensembles and as such, any of the ensembles in Set B can be selected as the best performed ensemble. ENS1 is selected and Table 5.19 shows the values obtained from this ensemble whereas Table 5.20 shows the false alarm rate of the ensemble classifier.
FIG. 5.16 Plot of accuracy of the ensembles across the varying dataset.
Table 5.19
Selected Ensemble Classifier
SET B |
ENS1 |
Accuracy |
99.31% |
Precision |
99.88% |
Recall |
98.74% |
F Score |
99.31% |
Table 5.20
False Alarm Rate of ENS1
False Negative : 1.100+/− 1.136 |
|||
True 1 |
True 0 |
Class Precision |
|
Pred. 1 |
874 |
11 |
98.76% |
Pred. 0 |
1 |
864 |
99.88% |
Class recall |
99.89% |
98.74% |
The false alarm rate of ENS1 as shown in Table 5.20 indicates that most of the predictions were classified correctly and a few were classified wrongly. Considering the margin of error as obtained from the false alarm value of 1.10, it can be concluded that this ensemble is very accurate in its classification. The results shown in Table 5.19 are used for comparison in the next phase of this project that is discussed in the next section. Also, it is indicated that the ROC of 0.697 is achieved as shown in Figure 5.17.
FIG. 5.17 ROC of the best performed ensemble classifier.
5.5. Comparative study
This section recaps the objective three of the research framework discussed in Chapter 3. A comparative study between the best performed individual classifier and the best performed ensemble is discussed in this section. Table 5.21 shows the results of both the best performed individual algorithm and ensemble algorithm. Also, Figure 5.18 shows the plot of the two best performed algorithms in individual and ensemble methods, respectively.
Table 5.21
Resulting Best Individual and Ensemble Algorithm
Metrics |
K-NN |
ENS1 |
Accuracy |
99.37% |
99.31% |
Precision |
99.76% |
99.88% |
Recall |
99.35% |
98.74% |
F Score |
99.55% |
99.31% |
FIG. 5.18 Plot of best individual against best ensemble algorithm.
Figure 5.16 shows the trend of the different performance metrics across K-NN and ENS1 algorithm. It can be observed that even though the best individual algorithm performs slightly better in accuracy than the best ensemble algorithm, the precision of ENS1 is higher than K-NN and this can be used to conclude that ensemble can be used to improve the performance of individual base algorithms. However, the performance of ensemble classifiers can decrease across more ensemble component classifiers if introduced to classifiers with very low performance as compared to the rest. This can be ensured by checking the accuracy of the base classifiers and the error rate must be checked to ensure varying error rates.
5.6. Summary
This chapter presents the implementation and results of Phase 2 and Phase 3 of the research methodology to investigate the more suitable method for increasing detection rate and in determining the best composition of classifiers that can give a good detection rate of phishing websites. Individual classifiers, namely, C4.5, LR, K-NN, and SVM were trained and tested with the reference performance metrics, namely, accuracy, precision, recall, and f-measure. The resulting best performed classifier scored 99.37% and 0.800 on false alarm rate. This chapter also provides the algorithms used in design ensemble that are the same with the algorithms used in individual classifier technique. This committee of ensembles was synergized through majority voting. The proposed ensemble design scored 99.31% on the overall accuracy and 1.10 on false alarm rate. The problem of high false alarm rate has been partially addressed. Results from the validation tests give evidence that the overall performance of the chosen ensemble classifier is almost the same as that of the best individual classifier model. The next chapter provides the conclusion of the study.