Abstract - A Machine Learning Approach to Phishing Detection and Defense (2015)

A Machine Learning Approach to Phishing Detection and Defense (2015)

Abstract

Phishing is a kind of cyber-attack in which perpetrators use spoofed emails and fraudulent web sites to lure unsuspecting online users into giving up personal information. This project looks at the phishing problem holistically by examining various research works and their countermeasures, and how to increase detection. It composes of three studies. In the first study, focus was on dataset gathering, pre-processing, features extraction and dataset division in order to make the dataset suitable for the classification process. In the second study, focus was on metric evaluation of a set of classifiers (C4.5, SVM, KNN and LR) using the accuracy, precision, recall and f-measure metrics. The output of the individual classifier study is used to choose the best performed individual classifier. The final study is divided into two parts; the first part focus on the increasing detection rate in phishing website algorithm by choosing a suitable design for classifier ensemble and also choosing the best ensemble classifier which will then be in comparison with the best individual classifier. The second part focused on choosing the better of the two studies. The resulting outcome of the study shows that the individual classifier method performed better with an accuracy of 99.37% while the chosen ensemble had an accuracy of 99.31%. This result can be attributed to the small size of dataset used as it was shown in past researches that K-NN performs better with a decreasing size of dataset while classifiers like SVM and C4.5 performs better with increasing size of dataset.

List of Abbreviation

ANN

Artificial Neural Network

APWG

Anti-Phishing Work Group

BART

Bayesian Additive Regression Trees

C4.5

Decision Tree

CA

Certificate Authority

DNS

Domain Name System

DR

Detection Rate

ENS

Ensemble

FAR

False Alarm Rate

FP

False Positive

FN

False Negative

FNR

False Negative Rate

FPR

False Positive Rate

HTML

Hyper Text Markup Language

HTTP

Hyper Text Transfer Protocol

HTTPS

Hyper Text Transfer Protocol Secure

IP

Internet Protocol

K-NN

K-Nearest Neighbor

LR

Linear Regression

MLP

Multi Layer Perceptron

NB

Naïve Bayesian

Pred.

Prediction

ROC

Receiver Operating Characteristic

SQL

Structured Query Language

SSL

Secure Socket Layer

SVM

Support Vector Machine

TP

True Positive

TPR

True Positive Rate

TN

True Negative

TTL

Time to Live

URL

Uniform Resource Locator

URI

Uniform Resource Identifier