A Machine Learning Approach to Phishing Detection and Defense (2015)
Chapter 3. Research Methodology
Abstract
The chapter discusses the research methodology as well as the different classifiers used in the study. First, a brief introduction of the chapter is important; this introduces the classifiers used as well as their description; the aim of the chapter is also discussed. Second, a quick overview of the research framework is discussed followed by an operational flow diagram. Furthermore, a breakdown of the framework showing how the processes are linked and the outputs expected from each stage of the research are presented. However, the research framework is more of a diagrammatic illustration of our objectives and it can as well be seen as a blueprint of the study. Third, a descriptive table showing the formulas used in calculating the accuracy of our implementation is shown. This is also known as the performance metric. At last, the description of the dataset and the source are discussed followed by the summary of the chapter.
Keywords
false alarm rates
accuracy
voting
ensemble
Phishtank
true positive
true negative
algorithm
3.1. Introduction
It is a classification of systematic work achieved via series of steps, which is used as guideline throughout a research, in order to accomplish the objectives of the research. This study focuses on a comparison between an ensemble system and classifier system in website phishing detection which are ensemble of classifiers (C5.0, SVM, LR, KNN) and individual classifiers. The aim is to investigate the effectiveness of each algorithm to determine accuracy of detection and false alarms rate. So, this chapter will provide a clear guideline on how the research’s goals and objectives shall be achieved. This chapter also discusses the dataset used in this study.
3.2. Research framework
Research framework will be for implementing the steps taken throughout the research. It is normally used as a guide for researchers so that they are more focused in the scope of their studies. Figure 3.1 shows an operational framework that will be followed in this study.
FIG. 3.1 Research framework.
Overview of Research Framework
The study is divided into three phases and each phase’s output is an input to the next phase. Phase-1 is based on dataset processing and feature extraction. Phase-2 is based on evaluating individual reference classifiers that involve training and testing using precision, recall, accuracy, and F1-score. Phase-3a is aimed to evaluate the ensemble of all the classifiers using precision, recall, accuracy, and F1-score. Phase-3b compares the result from the two techniques (individual and ensemble) in highlighting the better technique for phishing website detection based on the output of precision, recall, accuracy, and F1-score. These phases are depicted in the Figure 3.1.
3.3. Research design
The research will be conducted through three main phases. The following subsections will describe each phase briefly.
3.3.1. Phase 1: Dataset Processing and Feature Extraction
The processing of dataset was carried out on the collected datasets to better refine them to the requirement of the study. Many stages are involved in processing, some of this are: feature extraction, normalization, dataset division, and attribute weighting. These are very necessary in ensuring that the classifier can understand the dataset and properly classify them into the reference classes. The output of this phase is directly passed on to Phase 2 in evaluation the reference classifiers.
3.3.2. Phase 2: Evaluation of Individual Classifier
Evaluation of classifiers is required in this research to measure the performance achieved by a learning algorithm. To do this, a test set consisting of dataset with known labels is used. Each of the classifier is trained with a training set, applied to the test set, and then measured the performance of by comparing the predicted labels with the true labels (that were not available to the training algorithm) (Elkan, 2008). Therefore, it is important to evaluate the classifiers by training and testing with the dataset obtained from Phase 1 using the following performance metrics; precision, recall, f1-score, and accuracy. The formula used is shown in Table 3.1.
Table 3.1
Formula Used to Calculate the Performance
Performance Measure |
Description |
|
Percentage % classification |
Accuracy |
Accuracy is the overall correctness of the model and is calculated as the sum of correct classifications divided by the total number of classifications |
Precision |
Precision is a measure of the accuracy provided that a specific class has been predicted |
|
Recall/true positive rate (TPR)/detection rate (DR) |
Measuring the frequency of the correctly detected patterns as normal by the classifier. |
|
F1 Score |
F1 score (also F-score or F-measure) is a measure of a test’s accuracy. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. |
|
Error percentage (%) |
False positive rate (FPR) known as false alarm rate (FAR) |
The average of normal patterns wrongly classified as malicious patterns. |
False negative rate (FNR) |
The average of malicious patterns mistakenly classified as normal patterns. |
Elkan, 2008.
3.3.2.1. Classification Background
In order to properly understand the classification notations used in Table 3.1, a brief explanation of the notations will be discussed in this section with the aid of Table 3.2 that shows the relationship between the actual class and the expected class.
Table 3.2
Classification Context
Actual class (observation) |
||
Expected class (expectation) |
TP |
FP |
(True positive) |
(False positive) |
|
Correct result |
Unexpected result |
|
FN |
TN |
|
(False negative) |
(True negative) |
|
Missing result |
Correct absence of result |
Based on notations in Table 3.2;
1. Let TP represent the number of legitimate website correctly classified as legitimate.
2. Let TN represent the number of websites classified correctly as phishing website.
3. Let FP represent the number of legitimate websites classified as phishing website.
4. Let FN represent the number of websites classified as legitimate websites when they were actually phishing websites.
3.3.2.2. Classifier Performance
In this section, the process of detecting the performance of each classifier will be discussed. All classifiers will be evaluated based on the metrics (precision, recall, f1 score, and accuracy) already illustrated in Table 3.1. Furthermore, each of these classifiers will be introduced in this section in terms of performance.
3.3.2.2.1. C5.0 Algorithm
C5.0 is a decision tree algorithm used to measure the disorder in the collection of attribute and effectiveness of an attribute using entropy and information gain, respectively. The operation of C5.0 on the dataset can be categorized into two equations:
1. Calculating the entropy value of the data using the equation below:
(3.1)
Where E(S) – entropy of a collection of dataset, c – represents the number of classes in the system and pi – represents the number of instances proportion that belongs to class i.
2. Calculating the information gain for an attribute C, in a collection S, where E(S) is the entropy of the whole collection and Sw is the set of instances that have value w for attribute C.
(3.2)
Figure 3.2 shows the structure of a decision tree that partitions the dataset on the given attribute in order to calculate the information gain.
FIG. 3.2 Decision tree structure.
3.3.2.2.2. K-Nearest Neighbour
KNN employs the use of Euclidean Distance. It is based on the premise that every instance in the dataset can be represented as a point in N-dimensional space. Also, KNN uses a value K to represent the number of instances to be used after which the majority class will be chosen to classify the new instance. Figure 3.3, shows the structure of K-Nearest Neighbour algorithm. Equation (3.3) shows the formula used in the algorithm:
(3.3)
FIG. 3.3 KNN structure.
3.3.2.2.3. Support Vector Machine (SVM)
SVN is basically suitable for binary classification. It is based on a principle similar to KNN in that it represents the training set as points in an N-dimensional space and then attempts to construct a hyperplane that will divide the space into particular class labels with a precise margin of error. Figure 3.4 shows the structure of Support Vector Machine.
FIG. 3.4 SVM structure.
3.3.2.2.4. Linear Regression
Linear regression attempts to use a formula to generate a real-valued attribute. This method uses discrete value for prediction by setting a threshold T on the predicted real value. Equation (3.4) shows the formula used by linear regression
(3.4)
3.3.3. Phase 3a: Evaluation of Classifier Ensemble
Classifier ensemble was proposed to improve the classification performance of a single classifier (Kittler et al., 1998). The classifiers trained and tested in Phase 1 are used in this phase to determine the ensemble design. Also, this phase is divided into two subphases, that is, Phase 3a and Phase 3b.
Simple majority voting is used to ensemble the classifiers in determining detection accuracy. This is an iterative phase in which a threshold (acceptable detection accuracy set) is set and checked with the evaluation results until an optimum result is achieved.Equation (3.5) shows the formula for calculating detection accuracy.
(3.5)
Where a + b + c = 1 and a, b, and c are variables in the range of [0,1]
Phase 3a is divided into two parts, namely design and decision. In the design part, four algorithms are being considered for ensemble and a committee of three algorithms is used to form an ensemble since majority voting requires an odd number of participants. On the basis of the output of Phase 2, all the individual algorithms will be evaluated with the same metrics used in Phase 2 and then voted on. The decision part of Phase 3a rely on the output of the design part to decide which of the ensemble is the best performed which is then passed to Phase 3b for comparison with the best of the four algorithms evaluated in Phase 2.
3.3.4. Phase 3b: Comparison of Individual versus Ensemble Technique
In this part, the comparison of the two techniques discussed in Phase 2 and Phase 3a is carried out. The results obtained from the previously discussed phases are used as input for this phase. The results are then compared using tables and rich graphs.
3.4. Dataset
The dataset used will be divided into two parts namely, phishing and non-phishing dataset. The phishing dataset will be collected from phishtank whereas the non-phish dataset have been collected manually using Google engine. Dataset from Phishtank is discussed below:
3.4.1. Phishtank
Phishtank is a phish website repository available to users for free (open source). Because of the public nature of Phishtank with lots of suspected phish websites being submitted frequently, their database is updated by the hour and as such a total of 7,612 phish websites were obtained that have been accumulated over a period of 4 years since January 2008. Though as it has studied by several researchers that most phishing websites are only for temporary use, some of this websites have been reported offline. A filtration process is thus needed to ensure the freshness of the dataset. After this filtration had been done, 3611 phishing websites were confirmed online. This fresh dataset is used as phishing website for the purpose of this study. Table 3.3 shows the statistics collected from Phishtank as at the time of data collection.
Table 3.3
Phishtank Statistics
Online, valid phishes |
Total submissions |
Total votes |
13,054 |
1,615,087 |
6,362,861 |
Phishes verified as valid |
Suspected phishes submitted |
||
Total |
988,254 |
Total |
1,615,088 |
Online |
13,054 |
Online |
13,463 |
Offline |
975,200 |
Offline |
1,599,398 |
3.5. Summary
This chapter comprises of the methodology used as described in previous sections. Section 3.3.1 described the dataset processing and feature extraction, Section 3.3.2 described how individual classifiers will be evaluated using different metrics (precision, accuracy, recall, and f1-score). Section 3.3.3 described how the ensemble design is used in choosing the best ensemble classifier through decision-making process. Section 3.3.4 described the comparison process of selecting the better of the two techniques for website phishing detection. Finally, Section 3.4 described the dataset used in this study.