Decision Trees - Data Science from Scratch: First Principles with Python (2015)
Data Science from Scratch: First Principles with Python (2015)
Chapter 17.Decision Trees
A tree is an incomprehensible mystery.
Jim Woodring
DataSciencester’s VP of Talent has interviewed a number of job candidates from the site, with varying degrees of success.He’s collected a data set consisting of several (qualitative) attributes of each candidate, as well as whether that candidate interviewed well or poorly. Could you, he asks, use this data to build a model identifying which candidates will interview well, so that he doesn’t have to waste time conducting interviews?
This seems like a good fit for adecision tree, another predictive modeling tool in the data scientist’s kit.
What Is a Decision Tree?
A decision tree uses a tree structure to represent a number of possibledecision pathsand an outcome for each path.
If you have ever played the gameTwenty Questions, then it turns out you are familiar with decision trees.For example:
§ “I am thinking of an animal.”
§ “Does it have more than five legs?”
§ “No.”
§ “Is it delicious?”
§ “No.”
§ “Does it appear on the back of the Australian five-cent coin?”
§ “Yes.”
§ “Is it an echidna?”
§ “Yes, it is!”
This corresponds to the path:
“Not more than 5 legs” → “Not delicious” → “On the 5-cent coin” → “Echidna!”
in an idiosyncratic (and not very comprehensive) “guess the animal” decision tree (Figure 17-1).
Decision trees have a lot to recommend them. They’re very easy to understand and interpret, and the process by which they reach a prediction is completely transparent. Unlike the other models we’ve looked at so far, decision trees can easily handle a mix of numeric (e.g., number of legs) and categorical (e.g., delicious/not delicious) attributes and can even classify data for which attributes are missing.
At the same time, finding an “optimal” decision tree for a set of training data is computationally a very hard problem. (We will get around this by trying to build a good-enough tree rather than an optimal one, although for large data sets this can still be a lot of work.) More important, it is very easy (and very bad) to build decision trees that areoverfittedto the training data, and that don’t generalize well to unseen data. We’ll look at ways to address this.
Most people divide decision treesintoclassification trees(which produce categorical outputs) andregression trees(which produce numeric outputs). In this chapter, we’ll focus on classification trees, and we’ll work through the ID3 algorithm for learning a decision tree from a set of labeled data, which should help us understand how decision trees actually work. To make things simple, we’ll restrict ourselves to problems with binary outputs like “should I hire this candidate?” or “should I show this website visitor advertisement A or advertisement B?” or “will eating this food I found in the office fridge make me sick?”
Entropy
In order to build a decision tree, we willneed to decide what questions to ask and in what order. At each stage of the tree there are some possibilities we’ve eliminated and some that we haven’t. After learning that an animal doesn’t have more than five legs, we’ve eliminated the possibility that it’s a grasshopper. We haven’t eliminated the possibility that it’s a duck. Every possible question partitions the remaining possibilities according to their answers.
Ideally, we’d like to choose questions whose answers give a lot of information about what our tree should predict. If there’s a single yes/no question for which “yes” answers always correspond toTrueoutputs and “no” answers toFalseoutputs (or vice versa), this would be an awesome question to pick. Conversely, a yes/no question for which neither answer gives you much new information about what the prediction should be is probably not a good choice.
We capture this notion of “how much information” withentropy.You have probably heard this used to mean disorder. We use it to represent the uncertainty associated with data.
Imagine that we have a setSof data, each member of which is labeled as belonging to one of a finite number of classes. If all the data points belong to a single class, then there is no real uncertainty, which means we’d like there to be low entropy. If the data points are evenly spread across the classes, there is a lot of uncertainty and we’d like there to be high entropy.
In math terms, ifis the proportion of data labeled as class, we define the entropy as:
with the (standard) convention that.
Without worrying too much about the grisly details, each termis non-negative and is close to zero precisely whenis either close to zero or close to one (Figure 17-2).
This means the entropy will be small when everyis close to 0 or 1 (i.e., when most of the data is in a single class), and it will be larger when many of the’s are not close to 0 (i.e., when the data is spread across multiple classes). This is exactly the behavior we desire.
It is easy enough to roll all of this into a function:
defentropy(class_probabilities):
"""given a list of class probabilities, compute the entropy"""
returnsum(-p*math.log(p,2)
forpinclass_probabilities
ifp)# ignore zero probabilities
Our data will consist of pairs(input, label), which means that we’ll need to compute the class probabilities ourselves. Observe that we don’t actually care which label is associated with each probability, only what the probabilities are:
defclass_probabilities(labels):
total_count=len(labels)
return[count/total_count
forcountinCounter(labels).values()]
defdata_entropy(labeled_data):
labels=[labelfor_,labelinlabeled_data]
probabilities=class_probabilities(labels)
returnentropy(probabilities)
The Entropy of a Partition
What we’ve done so far is compute the entropy (think “uncertainty”) of a single set of labeled data.Now, each stage of a decision tree involves asking a question whose answer partitions data into one or (hopefully) more subsets. For instance, our “does it have more than five legs?” question partitions animals into those who have more than five legs (e.g., spiders) and those that don’t (e.g., echidnas).
Correspondingly, we’d like some notion of the entropy that results from partitioning a set of data in a certain way. We want a partition to have low entropy if it splits the data into subsets that themselves have low entropy (i.e., are highly certain), and high entropy if it contains subsets that (are large and) have high entropy (i.e., are highly uncertain).
For example, my “Australian five-cent coin” question was pretty dumb (albeit pretty lucky!), as it partitioned the remaining animals at that point into= {echidna} and= {everything else}, whereis both large and high-entropy. (has no entropy but it represents a small fraction of the remaining “classes.”)
Mathematically, if we partition our dataSinto subsetscontaining proportionsof the data, then we compute the entropy of the partition as a weighted sum:
which we can implement as:
defpartition_entropy(subsets):
"""find the entropy from this partition of data into subsets
One problem with this approach is that partitioning by an attribute with many different values will result in a very low entropy due to overfitting. For example, imagine you work for a bank and are trying to build a decision tree to predict which of your customers are likely to default on their mortgages, using some historical data as your training set. Imagine further that the data set contains each customer’s Social Security number. Partitioning on SSN will produce one-person subsets, each of which necessarily has zero entropy. But a model that relies on SSN iscertainnot to generalize beyond the training set. For this reason, you should probably try to avoid (or bucket, if appropriate) attributes with large numbers of possible values when creating decision trees.
Creating a Decision Tree
The VP provides you with the interviewee data,consisting of (per your specification) pairs(input, label), where eachinputis adictof candidate attributes, and each label is eitherTrue(the candidate interviewed well) orFalse(the candidate interviewed poorly). In particular, you are provided with each candidate’s level, her preferred language, whether she is active on Twitter, and whether she has a PhD:
Our tree will consist ofdecision nodes(which ask a question and direct us differently depending on the answer) andleaf nodes(which give us a prediction). We will build it using the relatively simpleID3algorithm, which operates in the following manner. Let’s say we’re given some labeled data, and a list of attributes to consider branching on.
§ If the data all have the same label, then create a leaf node that predicts that label and then stop.
§ If the list of attributes is empty (i.e., there are no more possible questions to ask), then create a leaf node that predicts the most common label and then stop.
§ Otherwise, try partitioning the data by each of the attributes
§ Choose the partition with the lowest partition entropy
§ Add a decision node based on the chosen attribute
§ Recur on each partitioned subset using the remaining attributes
This is what’s known as a “greedy” algorithm because, at each step, it chooses the most immediately best option.Given a data set, there may be a better tree with a worse-looking first move. If so, this algorithm won’t find it. Nonetheless, it is relatively easy to understand and implement, which makes it a good place to begin exploring decision trees.
Let’s manually go through these steps on the interviewee data set. The data set has bothTrueandFalselabels, and we have four attributes we can split on. So our first step will be to find the partition with the least entropy. We’ll start by writing a function that does the partitioning:
defpartition_by(inputs,attribute):
"""each input is a pair (attribute_dict, label).
returns a dict : attribute_value -> inputs"""
groups=defaultdict(list)
forinputininputs:
key=input[0][attribute]# get the value of the specified attribute
groups[key].append(input)# then add this input to the correct list
returngroups
and one that uses it to compute entropy:
defpartition_entropy_by(inputs,attribute):
"""computes the entropy corresponding to the given partition"""
partitions=partition_by(inputs,attribute)
returnpartition_entropy(partitions.values())
Then we just need to find the minimum-entropy partition for the whole data set:
forkeyin['level','lang','tweets','phd']:
printkey,partition_entropy_by(inputs,key)
# level 0.693536138896
# lang 0.860131712855
# tweets 0.788450457308
# phd 0.892158928262
The lowest entropy comes from splitting onlevel, so we’ll need to make a subtree for each possiblelevelvalue. EveryMidcandidate is labeledTrue, which means that theMidsubtree is simply a leaf node predictingTrue. ForSeniorcandidates, we have a mix ofTrues andFalses, so we need to split again:
senior_inputs=[(input,label)
forinput,labelininputsifinput["level"]=="Senior"]
forkeyin['lang','tweets','phd']:
printkey,partition_entropy_by(senior_inputs,key)
# lang 0.4
# tweets 0.0
# phd 0.950977500433
This shows us that our next split should be ontweets, which results in a zero-entropy partition. For these Senior-level candidates, “yes” tweets always result inTruewhile “no” tweets always result inFalse.
Finally, if we do the same thing for theJuniorcandidates, we end up splitting onphd, after which we find that no PhD always results inTrueand PhD always results inFalse.
Figure 17-3shows the complete decision tree.
Putting It All Together
Now that we’ve seen how the algorithm works, we would like to implement it more generally.This means we need to decide how we want to represent trees. We’ll use pretty much the most lightweight representation possible. We define atreeto be one of the following:
§ True
§ False
§ a tuple(attribute, subtree_dict)
HereTruerepresents a leaf node that returnsTruefor any input,Falserepresents a leaf node that returnsFalsefor any input, and a tuple represents a decision node that, for any input, finds itsattributevalue, and classifies the input using the corresponding subtree.
With this representation, our hiring tree would look like:
('level',
{'Junior':('phd',{'no':True,'yes':False}),
'Mid':True,
'Senior':('tweets',{'no':False,'yes':True})})
There’s still the question of what to do if we encounter an unexpected (or missing) attribute value. What should our hiring tree do if it encounters a candidate whoselevelis “Intern”? We’ll handle this case by adding aNonekey that just predicts the most common label. (Although this would be a bad idea ifNoneis actually a value that appears in the data.)
Given such a representation, we can classify an input with:
defclassify(tree,input):
"""classify the input using the given decision tree"""
# if this is a leaf node, return its value
iftreein[True,False]:
returntree
# otherwise this tree consists of an attribute to split on
# and a dictionary whose keys are values of that attribute
# and whose values of are subtrees to consider next
attribute,subtree_dict=tree
subtree_key=input.get(attribute)# None if input is missing attribute
ifsubtree_keynotinsubtree_dict:# if no subtree for key,
subtree_key=None# we'll use the None subtree
subtree=subtree_dict[subtree_key]# choose the appropriate subtree
returnclassify(subtree,input)# and use it to classify the input
All that’s left is to build the tree representation from our training data:
defbuild_tree_id3(inputs,split_candidates=None):
# if this is our first pass,
# all keys of the first input are split candidates
In the tree we built, every leaf consisted entirely ofTrueinputs or entirely ofFalseinputs. This means that the tree predicts perfectly on the training data set. But we can also apply it to new data that wasn’t in the training set:
tree=build_tree_id3(inputs)
classify(tree,{"level":"Junior",
"lang":"Java",
"tweets":"yes",
"phd":"no"})# True
classify(tree,{"level":"Junior",
"lang":"Java",
"tweets":"yes",
"phd":"yes"})# False
And also to data with missing or unexpected values:
classify(tree,{"level":"Intern"})# True
classify(tree,{"level":"Senior"})# False
NOTE
Since our goal was mainly to demonstratehowto build a tree, we built the tree using the entire data set. As always, if we were really trying to create a good model for something, we would have (collected more data and) split the data into train/validation/test subsets.
Random Forests
Given how closely decision trees can fit themselves totheir training data, it’s not surprising that they have a tendency to overfit. One way of avoiding this is a technique calledrandom forests, in which we build multiple decision trees and let them vote on how to classify inputs:
defforest_classify(trees,input):
votes=[classify(tree,input)fortreeintrees]
vote_counts=Counter(votes)
returnvote_counts.most_common(1)[0][0]
Our tree-building process was deterministic, so how do we get random trees?
One piece involves bootstrapping data (recall“Digression: The Bootstrap”). Rather than training each tree on all theinputsin the training set, we train each tree on the result ofbootstrap_sample(inputs). Since each tree is built using different data, each tree will be different from every other tree. (A side benefit is that it’s totally fair to use the nonsampled data to test each tree, which means you can get away with using all of your data as the training set if you are clever in how you measure performance.) This technique is known asbootstrap aggregatingorbagging.
A second source of randomness involves changing the way we chose thebest_attributeto split on. Rather than looking at all the remaining attributes, we first choose a random subset of them and then split on whichever of those is best:
# if there's already few enough split candidates, look at all of them
# now choose the best attribute only from those candidates
best_attribute=min(sampled_split_candidates,
key=partial(partition_entropy_by,inputs))
partitions=partition_by(inputs,best_attribute)
This is an example of a broader techniquecalledensemble learningin which we combine severalweak learners(typically high-bias, low-variance models) in order to produce an overall strong model.
Random forests are one of the most popular and versatile models around.
For Further Exploration
§ scikit-learn has manyDecision Treemodels. It also has anensemblemodule that includes aRandomForestClassifieras well as other ensemble methods.
§ We barely scratched the surface of decision trees and their algorithms.Wikipediais a good starting point for broader exploration.