A Machine Learning Approach to Phishing Detection and Defense (2015)
Abstract
Phishing is a kind of cyber-attack in which perpetrators use spoofed emails and fraudulent web sites to lure unsuspecting online users into giving up personal information. This project looks at the phishing problem holistically by examining various research works and their countermeasures, and how to increase detection. It composes of three studies. In the first study, focus was on dataset gathering, pre-processing, features extraction and dataset division in order to make the dataset suitable for the classification process. In the second study, focus was on metric evaluation of a set of classifiers (C4.5, SVM, KNN and LR) using the accuracy, precision, recall and f-measure metrics. The output of the individual classifier study is used to choose the best performed individual classifier. The final study is divided into two parts; the first part focus on the increasing detection rate in phishing website algorithm by choosing a suitable design for classifier ensemble and also choosing the best ensemble classifier which will then be in comparison with the best individual classifier. The second part focused on choosing the better of the two studies. The resulting outcome of the study shows that the individual classifier method performed better with an accuracy of 99.37% while the chosen ensemble had an accuracy of 99.31%. This result can be attributed to the small size of dataset used as it was shown in past researches that K-NN performs better with a decreasing size of dataset while classifiers like SVM and C4.5 performs better with increasing size of dataset.
List of Abbreviation
ANN
Artificial Neural Network
APWG
Anti-Phishing Work Group
BART
Bayesian Additive Regression Trees
C4.5
Decision Tree
CA
Certificate Authority
DNS
Domain Name System
DR
Detection Rate
ENS
Ensemble
FAR
False Alarm Rate
FP
False Positive
FN
False Negative
FNR
False Negative Rate
FPR
False Positive Rate
HTML
Hyper Text Markup Language
HTTP
Hyper Text Transfer Protocol
HTTPS
Hyper Text Transfer Protocol Secure
IP
Internet Protocol
K-NN
K-Nearest Neighbor
LR
Linear Regression
MLP
Multi Layer Perceptron
NB
Naïve Bayesian
Pred.
Prediction
ROC
Receiver Operating Characteristic
SQL
Structured Query Language
SSL
Secure Socket Layer
SVM
Support Vector Machine
TP
True Positive
TPR
True Positive Rate
TN
True Negative
TTL
Time to Live
URL
Uniform Resource Locator
URI
Uniform Resource Identifier