Feature Extraction - A Machine Learning Approach to Phishing Detection and Defense (2015)

A Machine Learning Approach to Phishing Detection and Defense (2015)

Chapter 4. Feature Extraction

Abstract

This chapter extensively discusses the dataset preprocessing techniques and how the features are extracted from the dataset. It started off with an introductory chapter that also covers the structural overview of the sections in this chapter. The next section discusses the processing techniques introduced on the dataset followed by the feature extraction process, which includes the description of the extracted features. Furthermore, the verification of the data used is discussed in the following section that leads to the normalization technique used in the data processing. The normalization process prepares the dataset before introducing it to the classifiers for classification. Perhaps, the process of normalization seems trivial but this is not the case in context of the dataset used because of the ranges variation across the dataset. It becomes imminent for an unbiased classification to be carried out by the classifiers. Furthermore, the division of the dataset into three parts for training and testing to investigate the accuracy of the classifiers across different sizes of dataset is also discussed. In the concluding section of this chapter, a brief summary of the chapter’s objective is highlighted.

Keywords

feature extraction

dataset

classification

preprocessing

phishing

nonphishing

4.1. Introduction

This chapter covers the dataset mode of collection and preparation including the extraction of all of the features intended for use in this study. Phishtank repository is used as the only source of phishing dataset whereas the non-phish dataset is collected manually using Google search engine. Basically, the output from feature extraction will be used as input in evaluating the individual classifiers as discussed in the phases of Chapter 3. Further sections of this chapter will discuss the feature extraction process as follows:Section 4.2 discusses the data processing procedures including the dataset statistics. Furthermore, subsections of Section 4.2 discuss the feature extraction process, data verification, data normalization; method and criteria used for normalization. Section 4.3discusses the dataset division; in terms of dataset grouping and the percentage of phishing and non-phishing dataset used with justification in order to increase the performance of the classifier training process to better improve the accuracy of the result. Finally,Section 4.4 discuss the summary of the chapter and the also discuss the accomplishment of this chapter in accordance to the objectives of this project.

4.2. Dataset processing

In order to realize a dataset suitable for the purpose of this project, the phish data collected from Phishtank was reorganized and some derived features were added. Also the dataset format that was downloaded from Phishtank repository had to be changed from dot csv format (.csv) to SQL database format (.sql) in order to make it acceptable for use with php. Because of the open source nature of Phishtank, most of the features needed for this project were not included and as such most features were extracted manually using php code. In the dataset collected from Phishtank, some of the features such as “phish_detail_url,” “submission_time,” “verified,” “verification_time,” and “online” were excluded and new features extracted and added to the dataset. After the extraction of features, normalization of dataset is then carried out using rapidminer (Akthar and Hahne, 2012) for easy computation. Table 4.1 shows the statistics of dataset used. Here, both the phishing and non-phishing websites have been tested for alive properties making sure that the data to be used must be correct and available for further step which is feature extraction. Meanwhile, Figure 4.1 shows the pie graph of the percentage of phishing to non-phishing data used in the study.

Table 4.1

Dataset Statistic

Phishing websites

Non-phishing websites

Total collected

7612

1638

Offline

3999

0

Tested alive

3611

1638

image

FIG. 4.1 Percentage phishing to non-phishing.

4.2.1. Feature Extraction

This section focuses on the effective minimal set of features that can be utilized in detecting phishing website (Figure 4.2). As summarized in Section 4.2, the features were manually extracted from the dataset using php code will be discussed in this section. First, the dataset for phishing website is downloaded from Phishtank (OpenDNS), tested to confirm it is online and then the features are extracted from each website. For non-phishing website, a webcrawler is used to extract the dataset from Google and also manual extraction was done using Google search engine and then the source code is extracted using php code in phpmyadmin webserver (Anewalt and Ackermann, 2005). The features extracted for each of the two scenarios (phishing and non-phishing) was carefully extracted from previous research work based on their individual weight. A combination of the features used in the work of Garera et al. (2007) and Zhang et al. (2007) is used for carefully selecting the features to be extracted. These features have proven very efficient in the detection of phishing websites (Huang et al., 2012). Furthermore, the features are labeled from 1 to 10 and they are termed as f1, f2, f3,.. and f10. The 10th column labeled f10 is the classification label of the dataset as phishing or non-phishing shown inequation (4.1).

image

FIG. 4.2 Overview of feature extraction.

Each of this features are explained briefly to support the claim of their importance in website phishing detection in relation to previous researches.

4.2.2. Extracted Features

1. Long URL: Long URL’s can be used to hide the suspicious part of in the address bar. Although scientifically there is reliable method of predicting the range of length that justify a website as phishing or non-phishing but then it is criteria used with other features in detecting suspicious sites. In the study of Basnet et al. (2011), a proposed length of ≤75 but there was no justification for behind their value. In this project a URL length of >127 character is used for non-phishing and ≤127 character for phishing website. This value is chosen based on the dataset collected by manually comparing the length of the most lengthy non-phishing website and phishing website in the dataset.

2. Dots: A secure web-page link contains at most 5 dots. If perhaps, there are more than 5 dots in a web page then it may be recognized as a phishing link. For example: http://www.website1.com.my/www.phish.com/index.php

3. IP-address: Some websites are hosted with IP-address instead of a fully qualified domain name. This is a very suspicious act since most of the legitimate website no longer use this method because of security reasons. Also, since most phishing websites stay online for a limited time, this feature can be considered as one of the very relevant phishing detection features.

4. SSL connection: It is necessary for a payment or e-commerce payment site to be secured in such that the data transmitted from and to the website is encrypted. Also it can be used to confirm the identity of a website by using SSL certificate which include specific information regarding the website. A sample from one of the non-phishing website collected is shown in Figure 4.3.

5. At “@” symbol: the phishing URL may include the “@” symbol somewhere within the address because the web browser, when reading an internet address; ignore everything to the left of the @ symbol, therefore, the address ebay.com@fake-auction.com would actually be “fake-auction.com.”

Hexadecimal: Particular to phishing are hex-encoded URLs. In the interest of compatibility, most mail user agents, web browsers, and HTTP servers all understand basic hex-encoded character equivalents, so that: http://210.219.241.125/images/paypal/cgi-bin/webscrcmd_login.php and http://%32%31%30.%32%31%39%2e%32%34%31%2e%31%32%35/%69%6d%61%67%65%73/paypal/cgi-bin/webscrcmd_login.php are functionally equivalent. The main illicit purpose of this encoding is to evade blacklist-based anti-spam filters which do not process hex character encoding (effectively, another insertion attack). It also evades protection mechanisms that prohibit IP addresses as URL destinations, on the assumption that “normal” http links will use more familiar DNS names.

6. Frame: Frames are a popular method of hiding attack content due to their uniform browser support and easy coding style. Example shown in Figure 4.3 describes a scenario in which the attacker defines two frames. The first frame contains the legitimate site URL information, whereas the second frame – occupying 0% of the browser interface that has a malicious code running. The page linked within the hidden frame can be used to deliver additional content, retrieving confidential information such as Session ID’ s or something more advance such as executing screen-grabbing and key-logging while the user is exchanging confidential information over the Internet. The output of Figure 4.4 is shown in Figure 4.5.

image

FIG. 4.3 URL with secure socket layer.

image

FIG. 4.4 Source code for frame embedded website.

image

FIG. 4.5 Output of frame embedded website scenario.

Redirect: A web application accepts a user-controlled input that specifies a link to an external site, and uses that link in a redirect. This simplifies phishing attacks. For example, “www.facebook.com/l/53201;phish.com.” This will redirect the page to phish.com, using “facebook” as a redirect site. Since “facebook” is a common social network site, the assumption of a user before clicking on the link is that the website contains “facebook” at the beginning and definitely end up in “facebook” website but perhaps, if the website link were to begin with “www.phish-facebook.com” the user becomes suspicious.

Submit: Phishing websites often include “submit” button in their source code which most often than none, contains an address to the phisher’s email or a database. When the user, thinking the website is a legitimate one, enter private information, and click the submit button, either a page cannot be found or some other error messages appear on the screen and mostly the user can assume it is a network problem of some sort. This is common in most duplicate website phishing scenario.

Figure 4.6 shows the flowchart of the feature extraction process indicating each feature as a variable and the conditions met for individual feature classification.

image

FIG. 4.6 Feature extraction process flow-chart.

In Figure 4.6, each Fi represent a step. Assuming each step is labeled Si where i is a correspondence of both F and S. The range of imageand for every i, a feature Fi is extracted and checked to confirm if it satisfies the criterion in the decision box. If steps Si=(1…9) fail to recognize a phishing threat, then the URL will be saved in the non-phish database for further data-mining.

The dataset is sorted and saved in a database which is later exported for further analysis. Table 4.2 shows the features and the terms used: the terms are used for simplicity in describing and referring to each of the features.

Table 4.2

Dataset Features

Features

Terms

Description

long_url

f1

Bogus URL address

Dots

f2

Excessive Dots in URL address

ip_address

f3

Using IP-address instead of a registered domain

ssl connection

f4

Secure (encryption) protocol for communication with the web server

at symbol

f5

Presence of ’@’ sign in the URL address

Hexadecimal

f6

The URL is masked with hexadecimal code

Frame

f7

Frame component of html code

Redirect

f8

Webpage with redirect link

Submit

f9

Webpage including submit button

Classification

fl0

Value assigned to classes

Garera et al., 2007; Zhang et al., 2007.

Based on case studies conducted, the 9 extracted features were categorized under 5 indicators. This indicators show the class of each feature with respect to the composition of a typical website. Table 4.3 shows the features as categorized under each indicator where N represents the number of features classified under each criterion.

Table 4.3

Feature Categorization under Phishing Indicators

Criteria

N

Terms

Phishing indicator

URL and domain identity

1

f3

Using IP address

Security and encryption

f4

Using SSL certificate

Source code and Java Script

3

f8

Redirect pages

Page style and content

1

f7

Empty page reference of other browser
pages

2

f9

Using forms with submit button

Web address bar

1

f1

Long URL addresses

2

f2

Excessive dots in a URL address

3

f5

Website links containing “@” symbol

4

f6

IP address masking with hexadecimal code

4.2.3. Data Verification

Data collected manually needs to be verified in order to ascertain the alive status especially in the case of phishing as it is known that phishing website mostly last for a limited period of time. For this reason, each URL must be verified before processing.

4.2.4. Data Normalization

There are many methods for data normalization that include min–max normalization (range transformation), z-score normalization, and normalization by decimal scaling. Min–max normalization performs a linear transformation on the original data. Suppose that mina and maxa are the minimum and the maximum values for attribute A. Min–max normalization maps a value v of A to v’ in the range (new-mina, new-maxa) by computing as shown in equation (4.1). To customize the normalization output to desired scale, range transformation method was selected. Equation (4.1) shows the range transformation formula used for normalization:

image(4.1)

The extracted features are set to values described in the rule set shown in equation (4.3) below where i = n and image.

In order to show the need for normalization of the dataset, Figures 4.7 and 4.8 show the dataset before and after normalization, respectively.

image

FIG. 4.7 Dataset before normalization.

image

FIG. 4.8 Dataset after normalization.

In Figure 4.7, the columns outlined (in red in the online version of the book) contains large numbers as compared to the other cells and as such normalization of this data is needed in order to prevent inaccuracy of results. The normalized data is shown in Figure 4.8.

4.3. Dataset division

After data processing, the dataset is divided into three sets for training and testing purpose and to investigate the accuracy of result. Two steps of data division are used, the first step is to divide the data into three different groups, and then choose different percentage of phishing and non-phishing for each group, for first group 50% phishing and remaining 50% non-phishing, second group 70% phishing and 30% non-phishing, and the last group 30% phishing and 70% are non-phishing as shown in Table 4.4.Furthermore, a cross-validation of 10% is used to estimate predictive performance of the selected attribute set is used (Hall et al., 2009).

Table 4.4

Total Data for Each Group and Each Process

Datasets

No. of phishing

No. of non-phishing

Set-A: 1750 instances (phishing)

525 (30%)

1225 (70%)

Set-B: 1750 instances (phishing)

875 (50%)

875 (50%)

Set-C: 1750 instances (phishing)

1225 (70%)

525 (30%)

Table 4.4 shows the total dataset and the division across phishing and non-phishing dataset. Also, it shows the number of instances covered for each process.

Finally, the prepared input and target output is tested and trained using stratified sample type method and a cross-validation of 10-folds which was selected after (10, 20, …, 90) folds have been tested and the standard deviation calculated. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Dataset for training process is used to train the NN model in identifying the pattern of data. Dataset for testing process is used to test the ability of network in identifying the pattern.

4.4. Summary

Chapter 4 has described the process of data collection, feature extraction, data normalizing, and description of the extracted features. The output of this chapter is a direct input for the Phase 1 in research methodology. The purpose is to preprocess the data for usability in achieving the objectives of this project.