Introduction - A Machine Learning Approach to Phishing Detection and Defense (2015)

A Machine Learning Approach to Phishing Detection and Defense (2015)

Chapter 1. Introduction


This chapter discusses some of the various security threats related to online users and then focuses on phishing – detection, solutions, and our proposed work. This chapter is organized as follows: first, a general introduction to website phishing and the different approaches currently used to mitigate the widespread of this online threat as discussed by previous researchers are discussed. Second, the problem background section explains the history of how phishing became a problem and how it attained an online threat level where it became important as a research area. Furthermore, this section also discusses how the threat affects e-commerce and the different approaches phishers (the perpetrators) invented over the years in carrying out the art of website phishing. Third, we explain the problem statement and the research questions addressed in this chapter. Fourth, the purpose of this study giving an overview of the methodology used and our expected result is discussed. Finally, the objectives of this chapter that serves as a roadmap to our study are listed; the significance of study discussing why our research is important is presented; and further, organization of our report to prepare the reader for the remaining content of the book serves as the concluding section of this chapter.








false negatives

1.1. Introduction

Cybercrime refers to crimes that target computer or network such that the computers may or may not have been fully instrumental to the commission of the crime (Martin et al., 2011). Computer crimes consist of a broad range of potentially criminal activities. However, it can be categorized into either of two parts (Martin et al., 2011):

1. Crimes that directly target computers, networks, or devices, and

2. Crimes aided by computers, networks, or devices, the main aim of which is not targeted at computer network or device.

Some examples of cybercrimes include spam, cyber terrorism, fraud, and phishing.

Phishing is an online identity theft in which an attacker uses fraudulent e-mails and bogus website in order to trick gullible customers into disclosing confidential information such as bank account information, website login information, and so forth. (Topkara et al., 2005). Phishing is an indicative type of illegal fraudulent attempt in online electronic communication. Phishing is a form of internet scam in which an attacker makes use of an email or website to illegally obtain private (Martin et al., 2011). It is a semantic attack which aims at harming the user rather than the computer. In general, phishing is a relatively new internet crime. The ease of cloning a legitimate bank website to convince unsuspecting users has made phishing difficult to curtail. Mostly, an email with a redirecting website link is sent to the user to update confidential information such as credit card, website login information, and bank account information that belongs to the licit. As explained by Aburrous et al. (2008), the complexity of understanding and analyzing phishing website is as a result of its involvement with technical problems and social. The main effect of phishing website is in the abuse of information through the compromise of user data that may harm victims in form of financial losses or valuables. Phishing in comparison to other forms of internet threat such as hacking and virus is a fast growing internet crime. In the broad usage of internet as a major form of communication, phishing can be implemented in different ways such as follows (Alnajim and Munro, 2009):

1. Email-to-email: when someone receives an email requesting sensitive information to be sent to the sender.

2. Email-to-website: when someone receives an email embedded with phishing web address.

3. Website-to-website: when someone clicks on phishing website through a search engine or an online advert.

4. Browser-to-website: when someone misspelled a legitimate web address on a browser and then referred to a phishing website that has a semantic similarity to the legitimate web address.

Different types of anti-phishing measures are being used to prevent phishing, such as, Anti-Phishing Working Group is an industry group, which formulates phishing reports from different online incident resources and makes it available to its paying members (RSA, 2006). Meanwhile, anti-phishing measures have been implemented as additional extension or toolbars for browsers, as features embedded in browsers, and as part of website login operation. Many of these toolbars have been used in the detection of phishing.Garera et al. (2007) proposed SpoofGuard which warns users of phishing website user (Chou et al., 2004). This tool makes use of URL, images, domain name, and link to evaluate the spoof likelihood.

Lucent Personalized Web Assistant (LPWA) is a tool that guards against identity theft to protect user’s personal information (Gabber et al., 1999, Kristol et al., 1998). It uses a function to define user variables such as email address, username, and password for each server visited by the user. Ross et al. (2005), proposed a similar approach in PwdHash.

Dhamija and Tygar (2005b) propose Dynamic Security Skins, which is another type of browser-based anti-phishing. This solution was implemented on the basis of their previous work on Human Interactive Proofs (Dhamija and Tygar, 2005a), which employs distinguishing features between legitimate and spoofed websites by human. Dynamic Security Skins ensures identity verification of a remote server by humans, but is hard to spoof by attackers (Dhamija and Tygar, 2005b). Furthermore, the tool uses a client-side password on the browser window with a secure remote password protocol (SRP) for verification-based authentication protocol. In addition, an image which is shared as a secret between the browser and the user ensures better security against spoofing. This secured image is either chosen by the user or as a result of random assignment and also, during each transaction, the image is being regenerated by the server and used in creating the browser skin. As a verification measure for the server, the user has to visually verify if the authenticity of the image. In exceptional cases when the user logs in from an untrusted computer, the tool will not be able to guarantee security Furthermore, it does not guard against malware and trusts the browser’s security during the SRP authentication.

Herzberg and Gbara (2004) introduced TrustBar which is a third-party certification solution against phishing. The authors proposed creating a Trusted Credentials Area (TCA). The TCA controls a significant area, located at the top of every browser window, and large enough to contain highly visible logos and other graphical icons for credentials identifying a legitimate page. Although their solution does not rely on complex security factors, it does not prevent against spoofing attacks. Specifically, since the logos of websites do not change, they can be used by an attacker to create a look alike TCA in an untrusted web page.

Due to the ever increasing phishing websites springing up by the day, it is becoming increasingly difficult to track and block them as attackers are coming up with innovative methods every day to entice unsuspecting users into divulging their personal information (Garera et al., 2007).

1.2. Problem background

As a new type of cyber security threat, phishing websites appear frequently in recent years, which have led to great harm in online financial services and data security (Zhuang et al., 2012). It has been projected that the vulnerability of most web servers have led to the evolution of most phishing websites such that the weakness in the web server is exploited by phishers to host counterfeiting website without the knowledge of the owner. It is also possible that the phisher hosts a new web server independent of any legitimate web server for phishing activities. Zhang et al. (2012) claimed that the method used in carrying out phishing can be different across regions. Furthermore, he also deduced that the phishers in America and China region have different approaches that he categorized into two on the basis of region:

1. The Chinese phishers prefer to register a new domain to deploy the phishing website.

2. The American phishers would rather deploy the phishing website using a hacked website.

Most researchers have worked on increasing the accuracy of website phishing detection through multiple techniques. Several classifiers such as Linear Regression, K-Nearest Neighbor, C5.0, Naïve Bayes, Support Vector Machine (SVM), and Artificial Neural Network among others have been used to train datasets in identifying phishing websites. These classifiers can be classified into two techniques: either probabilistic or machine learning. Based on these algorithms, several problems regarding phishing website detection have been solved by different researchers. Some of these algorithms were evaluated using four metrics, precision, recall, F1-Score, and accuracy.

Some studies have applied K-Nearest Neighbour (KNN) for phishing website classification. KNN classifier is a nonparametric classification algorithm. One of the characteristic of this classifier is that it generalizes whenever it is required to classify an instance. This has the effect of ensuring that no information is lost as can happen with the other eager learning techniques (Toolan and Carthy 2009). In addition, previous researches have shown that KNN can achieve accurate results, and sometimes more accurate than those of the symbolic classifiers. It was shown in a study carried out by Kim and Huh (2011) that KNN classifier achieved 99% detection rate. This result was better than the one obtained from LDA, Naïve Bayesian (NB), and Support Vector Machine (SVM). Also, since the performance of KNN is primarily determined by the choice of K, they tried to find the best K by varying it from 1 to 5; and found that KNN performs best when K = 1. This as well, helped in the high accuracy of KNN compared to other classifiers ensemble.

Meanwhile, Artificial Neural Network (ANN) is another popular machine learning technique. It consists of a collection of processing elements that are highly interconnected and transform a set of inputs to a set of desired outputs. The major disadvantage is in the time it takes for parameter selection and network learning. On the other hand, previous researches have shown that ANN can achieve very accurate results compared to other learning classification techniques. In a research carried out by Basnet et al. (2008), it was shown that Artificial Neural Network achieved an accuracy of 97.99%.

Based on related research (Aburrous et al., 2008; Alnajim and Munro, 2009; Kim and Huh, 2011; Miyamoto et al., 2007; Topkara et al., 2005; Zhang et al., 2012), successful rates have been achieved on detection accuracy using different learning algorithm but still website phishing detection is very much open for research because the rate at which phishing websites are deployed and the method used is faster than the solutions proposed by researchers. Generally, most of the recent studies were conducted on a small experimental data set, the robustness and effectiveness of these algorithms on real large-scale data sets cannot be guaranteed; furthermore, the number of phishing sites grows very fast, how to identify phishing websites from mass of legitimate websites in real time must also be addressed. As proposed by Miyamoto et al. (2007), in order to deal with the ever increasing phishing attacks, developing intelligent anti-phishing algorithms is paramount. Fundamentally, the detection algorithms are grouped into two distinct methods, which are URL filtering and URL whitelist-based detection method (Miyamoto et al., 2005).

The variation in the performance of different algorithms used in website phishing detection has led to ensemble. Classifier ensemble is a method of using various classifiers in enhancing analytical performance of individual component algorithm (Rokach, 2010). A study carried out by Toolan and Carthy (2009) showed an approach to categorizing phishing emails and nonphishing emails by using an algorithm known to achieve very high precision with other classifiers in ensemble (K-Nearest Neighbour, Support Vector Machine, Naïve Bayes, and Linear Regression) that achieve high recall. A success rate of more than 99% was obtained during their experiment. As a starting point, the work of Toolan and Carthy will be used as a baseline for this research since their success rate using ensemble method in email phishing detection is very impressive. As such, it is possible to improve performance of website phishing detection using ensemble method.

1.3. Problem statement

Phishing detection techniques do suffer low detection accuracy and high false alarm especially when novel phishing approaches are introduced. Besides, the most common technique used, blacklist-based method is inefficient in responding to emanating phishing attacks since registering new domain has become easier, no comprehensive blacklist can ensure a perfect up-to-date database. Furthermore, page content inspection has been used by some strategies to overcome the false negative problems and complement the vulnerabilities of the stale lists. Moreover, page content inspection algorithms each have different approach to phishing website detection with varying degrees of accuracy. Therefore, ensemble can be seen to be a better solution as it can combine the similarity in accuracy and different error-detection rate properties in selected algorithms. Therefore, this study will address a couple of research:

1. How to process raw dataset for phishing detection?

2. How to increase detection rate in phishing websites algorithms?

3. How to reduce false negative rate in phishing websites algorithm?

4. What are the best compositions of classifiers that can give a good detection rate of phishing website?

1.4. Purpose of study

In this research, performance of individual classifiers as well as the ensemble of classifiers that utilizes different learning paradigms and voting scheme will be compared in terms of detection accuracy and false negative. At the end of this comparison, the algorithm that shows better performance in terms of detection accuracy and low false negative rate will be highlighted.

1.5. Project objectives

There are four objectives for this project. They are:

1. To carry out dataset processing and feature extraction.

2. To evaluate individual classifiers performance in varying dataset.

3. To determine the best design ensemble and select the best ensemble classifier.

4. To compare the result obtained at the end of the ensemble with the results obtained from individual algorithm.

1.6. Scope of study

The scopes of this research are as follow:

1. The phishing dataset is obtained from phishtank ( whereas the legitimate website is obtained manually using Google webcrawlers.

2. First, the dataset is divided into three sets which are then used to train and test the algorithms; Decision Tree (C4.5), Support Vector Machine (SVM), Linear Regression (LR), and K-Nearest Neighbor KNN.

3. The performance metrics of the reference algorithms based on precision, recall, f1-score and accuracy of the three algorithms are compared.

4. The website features are categorized based on five criteria: URL and Domain Identity, Security and Encryption, Source Code and Java Script, Page Style and Content and Web Address

5. Experimental implementation of this project is done using rapidminer.

1.7. The significance of study

Nowadays, there is an increasing need to detect phishing websites due to the adverse effect they can have on their victims. Lots of work has been done on website phishing detection using several techniques to achieve the same goal. This study evaluates the performance of ensemble method and individual algorithms: C5.0, SVM, and LR algorithms as regards to detection accuracy and false alarms by studying each of them individually and in ensemble mode and investigate to show which is more suitable to be used in phishing detection.

1.8. Organization of report

The book consists of six chapters. Chapter 1 describes the introduction, background of the study, research objectives and questions, the scope of the study and its primary objectives. Chapter 2 reviews available and related literature on website phishing detection.Chapter 3 describes the study methodology along with the appropriate framework for the study. Chapter 4 describes the dataset collection, preprocessing technique used, feature extraction, and dataset division. Chapter 5 discusses the implementation, result and analysis based on research framework. Finally, Chapter 6 concludes the book with a closing remark, recap of objectives, contribution, and future work.