# Data Science For Dummies (2016)

*Part 2*

### Using Data Science to Extract Meaning from Your Data

*Chapter 6*

### Using Clustering to Subdivide Data

**IN THIS CHAPTER**

**Understanding the basics of clustering**

**Clustering your data with the k-means algorithm and kernel density estimation**

**Getting to know hierarchical and neighborhood clustering algorithms**

**Checking out decision tree and random forest algorithms**

Data scientists use clustering to help them divide their unlabeled data into subsets. The basics behind clustering are relatively easy to understand, but things get tricky fast when you get into using some of the more advanced algorithms. In this chapter, I introduce the basics behind clustering. I follow that by introducing several nuanced algorithms that offer clustering solutions to meet your requirements, based on the specific characteristics of your feature dataset.

*Introducing Clustering Basics*

To grasp advanced methods for use in clustering your data, you should first take a few moments to make sure you have a firm understanding of the basics that underlie all forms of clustering. Clustering is a form of *machine learning* — the machine in this case is your computer, and *learning* refers to an algorithm that’s repeated over and over until a certain set of predetermined conditions is met. Learning algorithms are generally run until the point that the final analysis results will not change, no matter how many additional times the algorithm is passed over the data.

Clustering is one of the two main types of machine learning: In *unsupervised* machine learning, the data in the dataset is unlabeled. Because the data is unlabeled, the algorithms must use inferential methods to discover patterns, relationships, and correlations within the raw dataset. To put clustering through its paces, I want to use a readily available sample dataset from the World Bank’s open datasets on country income and education. This data shows the percentage of income earned by the bottom 10 percent of the population in each country and the percentage of children who complete primary school in each country.

For this chapter’s discussion, I’m isolating the median reported statistic from the years 2000 to 2003. (Some countries report on these statistics only every few years, and during 2000 to 2003, this data was fully reported by 81 of 227 countries.)

In datasets about the percentage of children who complete primary school, some are reported at over 100 percent. That’s because some countries count this statistic at different ages, but the data was *normalized* so that the percentage distribution is proportionally scaled across the range of countries represented in the dataset. In other words, although the total scale exceeds 100 percent, the values have been normalized so that they’re proportional to one another and you’re getting an apples-to-apples comparison. Thus, the fact that some countries report completion rates greater than 100 percent has no adverse effect on the analysis you make in this chapter.

*Getting to know clustering algorithms*

You use clustering algorithms to subdivide unlabeled datasets into clusters of observations that are most similar for a predefined feature. If you have a dataset that describes multiple features about a set of observations and you want to group your observations by their feature similarities, use clustering algorithms. There are different clustering methods, depending on how you want your dataset to be divided. The two main types of clustering algorithms are

· **Partitional:** Algorithms that create only a single set of clusters

· **Hierarchical:** Algorithms that create separate sets of nested clusters, each in its own hierarchal level

You can read about both approaches later in this chapter, but for now, start by looking at *Figure 6-1*, a simple scatter plot of the Country Income and Education datasets.

** FIGURE 6-1:** A simple scatter plot.

In unsupervised clustering, you start with this data and then proceed to divide it into subsets. These subsets, called *clusters,* are composed of observations that are most similar to one another. In *Figure 6-1*, it appears that there are at least two clusters, probably three — one at the bottom with low income and education, and then the high-education countries look like they might be split between low and high incomes.

*Figure 6-2* shows the result of *eyeballing,* or making a visual estimate of, clusters in this dataset.

** FIGURE 6-2:** A simple scatter plot, showing eyeballed estimations of clustering.

Although you can generate visual estimates of clusters, you can achieve much more accurate results when dealing with much larger datasets by using algorithms to generate clusters for you. Visual estimation is a rough method that’s useful only on smaller datasets of minimal complexity. Algorithms produce exact, repeatable results, and you can use algorithms to generate clusters from multiple dimensions of data within your dataset.

Clustering algorithms are appropriate in situations where the following characteristics are true:

· You know and understand the dataset you’re analyzing.

· Before running the clustering algorithm, you don’t have an exact idea of the nature of the subsets (clusters). Often, you don’t even know how many subsets are in the dataset before you run the algorithm.

· The subsets (clusters) are determined by only the single dataset you’re analyzing.

· Your goal is to determine a model that describes the subsets in a single dataset and only this dataset.

If you add more data to your dataset after you’ve already built the model, be sure to rebuild the model from scratch to get more complete and accurate model results.

*Looking at clustering similarity metrics*

Classification methods are based on calculating the similarity or difference between two observations. If your dataset is *numeric* — composed of only numerical features — and can be portrayed on an n-dimensional plot, you can use various geometric metrics to scale your multidimensional data.

An *n-dimensional plot* is a multidimensional scatter plot that you can use to plot *n* number of dimensions of data.

Some popular geometric metrics used for calculating distances between observations are simply different geometric functions that are useful for modeling distances between points:

· **Euclidean metric:** A measure of the distance between points plotted on a Euclidean plane.

· **Manhattan metric:** A measure of the distance between points where distance is calculated as the sum of the absolute value of the differences between two points’ Cartesian coordinates.

· **Minkowski distance metric:** A generalization of the Euclidean and Manhattan distance metrics. Quite often, these metrics can be used interchangeably.

· **Cosine similarity metric:** The cosine metric measures the similarity of two data points based on their orientation, as determined by taking the cosine of the angle between them.

Lastly, for non-numeric data, you can use metrics like the *Jaccard distance metric,* an index that compares the number of features that two observations have in common. For example, to illustrate a Jaccard distance, look at these two text strings:

· Saint Louis de Ha-ha, Quebec

· St-Louis de Ha!Ha!, QC

What features do these text strings have in common? And what features are different between them? The Jaccard metric generates a numerical index value that quantifies the similarity between text strings.

*Identifying Clusters in Your Data*

You can use many different algorithms for clustering, but the speed and robustness of the k-means algorithm makes it a popular choice among experienced data scientists. As alternatives, kernel density estimation methods, hierarchical algorithms, and neighborhood algorithms are also available to help you identify clusters in your dataset.

*Clustering with the k-means algorithm*

The *k-means* clustering algorithm is a simple, fast, unsupervised learning algorithm that you can use to predict groupings within a dataset. The model makes its prediction based on the *number of centroids present* — represented by *k,* a model parameter that you must define — and the nearest mean values, measured by the Euclidean distance between observations. Because the features of a dataset are usually on different scales, the difference of scales can distort the results of this distance calculation. To avoid this problem, scale your variables before using k-means to predict data groupings.

The quality of the clusters is heavily dependent on the correctness of the *k* value you specify. If your data is 2- or 3-dimensional, a plausible range of *k* values may be visually determinable. In the eyeballed approximation of clustering from the World Bank Income and Education data scatter plot (refer to *Figure 6-2*), a visual estimation of the *k* value would equate to three clusters, or *k* = 3.

When defining the *k* value, it may be possible to choose the number of centroids by looking at a scatter plot (if your dataset is 2- or 3-dimensional) or by looking for obvious, significant groupings within your dataset’s variables. You can pick the number of centroids based on the number of groupings that you know exist in the dataset, or by the number of groupings that you want to exist in the dataset. Whatever the case, use your subjective knowledge about the dataset when choosing the number of clusters to be modeled.

If your dataset has more than three dimensions, however, you can use computational methods to generate a good value for *k.* One such method is the *silhouette coefficient* — a method that calculates the average distance of each point from all other points in a cluster and then compares that value with the average distance to every point in every other cluster. Luckily, because the k-means algorithm is efficient, it does not require much computer processing power, and you can easily calculate this coefficient for a wide range of *k* values.

The k-means algorithm works by placing sample cluster centers on an *n*-dimensional plot and then evaluating whether moving them in any single direction would result in a new center with higher *density* — with more observations closer to it, in other words. The centers are moved from regions of lower density to regions of higher density until all centers are within a region of *local maximum density* — a true center of the cluster, where each cluster gets a maximum number of points closest to its cluster center. Whenever possible, you should try to place the centers yourself, manually. If that’s not possible, simply place the centers randomly and run the algorithm several times to see how often you end up with the same clusters.

One weakness of the k-means algorithm is that it may produce incorrect results by placing cluster centers in areas of *local minimum density*. This happens when centers get lost in *low-density regions* (in other words, regions of the plot that have relatively few points plotted in them) and the algorithm-driven *directional movement* (the movement that’s meant to increase point density) starts to bounce and oscillate between faraway clusters. In these cases, the center gets caught in a low-density space that’s located between two high-point density zones. This results in erroneous clusters based around centers that converge in areas of low, local minimum density. Ironically, this happens most often when the underlying data is very well-clustered, with tight, dense regions that are separated by wide, sparse areas.

To try things out for yourself, start clustering your data with the k-means methods by using either R’s cluster package or Python’s Scikit-learn library. For more on R’s cluster package, check out

*http://cran.r-project.org/web/packages/cluster/cluster.pdf;* for more on Scikit-learn, check out *http://scikit-learn.org*

*Estimating clusters with kernel density estimation (KDE)*

If the k-means algorithm doesn’t appeal to you, one alternative way to identify clusters in your data is to use a density smoothing function instead. *Kernel density estimation* (KDE) is that smoothing method; it works by placing a *kernel* — a weighting function that is useful for quantifying density — on each data point in the dataset and then summing the kernels to generate a kernel density estimate for the overall region. Areas of greater point density will sum out with greater kernel density, and areas of lower point density will sum out with less kernel density.

Because kernel smoothing methods don’t rely on cluster center placement and clustering techniques to estimate clusters, they don’t exhibit a risk of generating erroneous clusters by placing centers in areas of local minimum density. Where k-means algorithms generate hard-lined definitions between points in different clusters, KDE generates a plot of gradual density change between observations. For this reason, it’s a helpful aid when eyeballing clusters. *Figure 6-3* shows what the World Bank Income and Education scatter plot looks like after a KDE has been applied.

** FIGURE 6-3:** KDE smoothing of the World Bank’s Income and Education data scatter plot.

In *Figure 6-3*, you can see that the white spaces between clusters have been reduced. When you look at the figure, it’s fairly obvious that there are at least three clusters, and possibly more, if you want to allow for small clusters.

*Clustering with hierarchical algorithms*

A hierarchical clustering algorithm is yet another alternative to k-means clustering. In comparison to k-means clustering, the hierarchical clustering algorithm is a slower, clunkier unsupervised clustering algorithm. It predicts groupings within a dataset by calculating the distance and generating a link between each singular observation and its nearest neighbor. It then uses those distances to predict subgroups within a dataset. If you’re carrying out a statistical study or analyzing biological or environmental data, hierarchical clustering might be your ideal machine learning solution.

To visually inspect the results of your hierarchical clustering, generate a *dendrogram* — a visualization tool that depicts the similarities and branching between groups in a data cluster. You can use several different algorithms to build a dendrogram, and the algorithm you choose dictates where and how branching occurs within the clusters. Additionally, dendrograms can be built either *bottom-up* (by assembling pairs of points and then aggregating them into larger and larger groups) or *top-down* (by starting with the full dataset and splitting it into smaller and smaller groups). Looking at the dendrogram results makes it easier to decide the appropriate number of clusters for your dataset. In the dendrogram example shown in *Figure 6-4*, the underlying dataset appears to have either three or four clusters.

** FIGURE 6-4:** A schematic layout of a sample dendrogram.

In hierarchical clustering, the distance between observations is measured in three different ways: Euclidean, Manhattan, or Cosine. Additionally, linkage is formed by three different methods: Ward, Complete, and Average. When deciding what distance and linkage parameters to use, trial-and-error is an easy approach. Just try each combination and then compare all your model results. Go with the model that produces the most accurate prediction.

Hierarchical clustering algorithms are more computationally expensive than k-means algorithms because with each iteration of hierarchical clustering, many observations must be compared to many other observations. The benefit, however, is that hierarchical clustering algorithms are not subject to errors caused by center convergence at areas of local minimum density (as exhibited with the k-means clustering algorithms).

If you’re working with a large dataset, watch out! Hierarchical clustering will probably be *way* too slow.

If you want to get started working with hierarchical clustering algorithms, check out R’s hclust package or (again) Python’s Scikit-learn library. (If you’re curious about hclust, check out this site:

*https://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html*

Neither k-means nor hierarchical clustering algorithms perform well when clusters are *nonglobular* — a configuration where some points in a cluster are closer to points in a different cluster than they are to points in the center of their own cluster. If your dataset shows nonglobular clustering, you can use neighborhood clustering algorithms, like DBScan, to determine whether each point is closer to its neighbors in the same cluster or closer to its neighboring observations in other clusters. (*Figure 6-5*, in the following section, shows an example of neighborhood clustering.)

** FIGURE 6-5:** Sample output from a neighborhood clustering algorithm.

*Dabbling in the DBScan neighborhood*

*Density-based spatial clustering of applications with noise (DBScan)* is an unsupervised learning method that works by clustering *core samples* (dense areas of a dataset) while simultaneously demarking *non-core samples*(portions of the dataset that are comparatively sparse). It’s a neighborhood clustering algorithm that’s ideal for examining two or more variables together to identify outliers. It’s particularly useful for identifying *collective* outliers — outliers that appear nearby to one another, all having similar values that are anomalous to most values in the variable.

With DBScan, take an iterative, trial-and-error approach to find the ideal number of outliers for inspection. When experimenting with the DBScan model, outliers should comprise 5 percent or less of the dataset’s observations. You must adjust the model parameters until you’ve isolated this small select group of observations.

Neighborhood clustering algorithms are generally effective, but they are subject to these two weaknesses:

· **Neighborhood clustering can be computationally expensive.** With every iteration of this method, every data point might have to be compared to every other data point in the dataset.

· **With neighborhood clustering, you might have to provide the model with empirical parameter values for expected cluster size and cluster density.** If you guess either of these parameters incorrectly, the algorithm misidentifies clusters, and you must start the whole long process over again to fix the problem. If you choose to use the DBScan method, you’re required to specify these parameters. (As an alternative, you could try the average nearest neighbor and k-nearest neighbor algorithms, which are discussed in *Chapter 7*.)

To avoid making poor guesses for the cluster size and cluster density parameters, you can first use a quick k-means algorithm to determine plausible values.

*Categorizing Data with Decision Tree and Random Forest Algorithms*

In cases where clustering algorithms fail, decision tree and random forest algorithms might just offer you a perfect alternative machine learning solution. At certain times, you can get stuck trying to cluster and classify data from a non-numerical dataset. It’s times like these that you can use a decision tree model to help cluster and classify your data correctly.

A *decision tree* algorithm works by developing a set of yes-or-no rules that you can follow for new data to see exactly how it will be characterized by the model. But you must be careful when using decision tree models, because they run the high risk of *error propagation,* which occurs whenever one of the model rules is incorrect. Errors are generated in the results of decisions that are made based on that incorrect rule, and then propagated through every other subsequent decision made along that branch of the tree.

To illustrate this type of algorithm, consider a dataset that’s often used in machine learning demonstrations — the list of passenger names from the *Titanic.* Using a simple decision tree model, you can predict that if a passenger was female or was a male child with a large family, he or she probably survived the catastrophe. *Figure 6-6* illustrates this determination.

** FIGURE 6-6:** A decision tree model predicts survival rates from the

*Titanic*catastrophe.

Lastly, random forest algorithms are a slower but more powerful alternative. Instead of building a tree from the data, the algorithm creates random trees and then determines which one best classifies the testing data. This method eliminates the risk of error propagation that is inherent in decision tree models.