Data Science from Scratch: First Principles with Python (2015)

Chapter 22. Recommender Systems

O nature, nature, why art thou so dishonest, as ever to send men with these false recommendations into the world!

Henry Fielding

Another common data problem is producing recommendations of some sort. Netflix recommends movies you might want to watch. Amazon recommends products you might want to buy. Twitter recommends users you might want to follow. In this chapter, we’ll look at several ways to use data to make recommendations.

In particular, we’ll look at the data set of users_interests that we’ve used before:

users_interests = [

    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],

    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],

    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],

    ["R", "Python", "statistics", "regression", "probability"],

    ["machine learning", "regression", "decision trees", "libsvm"],

    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],

    ["statistics", "probability", "mathematics", "theory"],

    ["machine learning", "scikit-learn", "Mahout", "neural networks"],

    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],

    ["Hadoop", "Java", "MapReduce", "Big Data"],

    ["statistics", "R", "statsmodels"],

    ["C++", "deep learning", "artificial intelligence", "probability"],

    ["pandas", "R", "Python"],

    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],

    ["libsvm", "regression", "support vector machines"]

And we’ll think about the problem of recommending new interests to a user based on her currently specified interests.

Manual Curation

Before the Internet, when you needed book recommendations you would go to the library, where a librarian was available to suggest books that were relevant to your interests or similar to books you liked.

Given DataSciencester’s limited number of users and interests, it would be easy for you to spend an afternoon manually recommending interests for each user. But this method doesn’t scale particularly well, and it’s limited by your personal knowledge and imagination. (Not that I’m suggesting that your personal knowledge and imagination are limited.) So let’s think about what we can do with data.

Recommending What’s Popular

One easy approach is to simply recommend what’s popular:

popular_interests = Counter(interest

                            for user_interests in users_interests

                            for interest in user_interests).most_common()

which looks like:

[('Python', 4),

 ('R', 4),

 ('Java', 3),

 ('regression', 3),

 ('statistics', 3),

 ('probability', 3),

 # ...

Having computed this, we can just suggest to a user the most popular interests that he’s not already interested in:

def most_popular_new_interests(user_interests, max_results=5):

    suggestions = [(interest, frequency)

                   for interest, frequency in popular_interests

                   if interest not in user_interests]

    return suggestions[:max_results]

So, if you are user 1, with interests:

["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"]

then we’d recommend you:

most_popular_new_interests(users_interests[1], 5)

# [('Python', 4), ('R', 4), ('Java', 3), ('regression', 3), ('statistics', 3)]

If you are user 3, who’s already interested in many of those things, you’d instead get:

[('Java', 3),

 ('HBase', 3),

 ('Big Data', 3),

 ('neural networks', 2),

 ('Hadoop', 2)]

Of course, “lots of people are interested in Python so maybe you should be too” is not the most compelling sales pitch. If someone is brand new to our site and we don’t know anything about them, that’s possibly the best we can do. Let’s see how we can do better by basing each user’s recommendations on her interests.

User-Based Collaborative Filtering

One way of taking a user’s interests into account is to look for users who are somehow similar to him, and then suggest the things that those users are interested in.

In order to do that, we’ll need a way to measure how similar two users are. Here we’ll use a metric called cosine similarity. Given two vectors, v and w, it’s defined as:

def cosine_similarity(v, w):

    return dot(v, w) / math.sqrt(dot(v, v) * dot(w, w))

It measures the “angle” between v and w. If v and w point in the same direction, then the numerator and denominator are equal, and their cosine similarity equals 1. If v and w point in opposite directions, then their cosine similarity equals -1. And if v is 0 whenever w is not (and vice versa) thendot(v, w) is 0 and so the cosine similarity will be 0.

We’ll apply this to vectors of 0s and 1s, each vector v representing one user’s interests. v[i] will be 1 if the user is specified the ith interest, 0 otherwise. Accordingly, “similar users” will mean “users whose interest vectors most nearly point in the same direction.” Users with identical interests will have similarity 1. Users with no identical interests will have similarity 0. Otherwise the similarity will fall in between, with numbers closer to 1 indicating “very similar” and numbers closer to 0 indicating “not very similar.”

A good place to start is collecting the known interests and (implicitly) assigning indices to them. We can do this by using a set comprehension to find the unique interests, putting them in a list, and then sorting them. The first interest in the resulting list will be interest 0, and so on:

unique_interests = sorted(list({ interest

                                 for user_interests in users_interests

                                 for interest in user_interests }))

This gives us a list that starts:

['Big Data',

 'C++',

 'Cassandra',

 'HBase',

 'Hadoop',

 'Haskell',

 # ...

Next we want to produce an “interest” vector of 0s and 1s for each user. We just need to iterate over the unique_interests list, substituting a 1 if the user has each interest, a 0 if not:

def make_user_interest_vector(user_interests):

    """given a list of interests, produce a vector whose ith element is 1

    if unique_interests[i] is in the list, 0 otherwise"""

    return [1 if interest in user_interests else 0

            for interest in unique_interests]

after which, we can create a matrix of user interests simply by map-ping this function against the list of lists of interests:

user_interest_matrix = map(make_user_interest_vector, users_interests)

Now user_interest_matrix[i][j] equals 1 if user i specified interest j, 0 otherwise.

Because we have a small data set, it’s no problem to compute the pairwise similarities between all of our users:

user_similarities = [[cosine_similarity(interest_vector_i, interest_vector_j)

                      for interest_vector_j in user_interest_matrix]

                     for interest_vector_i in user_interest_matrix]

after which, user_similarities[i][j] gives us the similarity between users i and j.

For instance, user_similarities[0][9] is 0.57, as those two users share interests in Hadoop, Java, and Big Data. On the other hand, user_similarities[0][8] is only 0.19, as users 0 and 8 share only one interest, Big Data.

In particular, user_similarities[i] is the vector of user i’s similarities to every other user. We can use this to write a function that finds the most similar users to a given user. We’ll make sure not to include the user herself, nor any users with zero similarity. And we’ll sort the results from most similar to least similar:

def most_similar_users_to(user_id):

    pairs = [(other_user_id, similarity)                      # find other

             for other_user_id, similarity in                 # users with

                enumerate(user_similarities[user_id])         # nonzero

             if user_id != other_user_id and similarity > 0]  # similarity

    return sorted(pairs,                                      # sort them

                  key=lambda (_, similarity): similarity,     # most similar

                  reverse=True)                               # first

For instance, if we call most_similar_users_to(0) we get:

[(9, 0.5669467095138409),

 (1, 0.3380617018914066),

 (8, 0.1889822365046136),

 (13, 0.1690308509457033),

 (5, 0.1543033499620919)]

How do we use this to suggest new interests to a user? For each interest, we can just add up the user-similarities of the other users interested in it:

def user_based_suggestions(user_id, include_current_interests=False):

    # sum up the similarities

    suggestions = defaultdict(float)

    for other_user_id, similarity in most_similar_users_to(user_id):

        for interest in users_interests[other_user_id]:

            suggestions[interest] += similarity

    # convert them to a sorted list

    suggestions = sorted(suggestions.items(),

                         key=lambda (_, weight): weight,

                         reverse=True)

    # and (maybe) exclude already-interests

    if include_current_interests:

        return suggestions

    else:

        return [(suggestion, weight)

                for suggestion, weight in suggestions

                if suggestion not in users_interests[user_id]]

If we call user_based_suggestions(0), the first several suggested interests are:

[('MapReduce', 0.5669467095138409),

 ('MongoDB', 0.50709255283711),

 ('Postgres', 0.50709255283711),

 ('NoSQL', 0.3380617018914066),

 ('neural networks', 0.1889822365046136),

 ('deep learning', 0.1889822365046136),

 ('artificial intelligence', 0.1889822365046136),

 #...

These seem like pretty decent suggestions for someone whose stated interests are “Big Data” and database-related. (The weights aren’t intrinsically meaningful; we just use them for ordering.)

This approach doesn’t work as well when the number of items gets very large. Recall the curse of dimensionality from Chapter 12 — in large-dimensional vector spaces most vectors are very far apart (and therefore point in very different directions). That is, when there are a large number of interests the “most similar users” to a given user might not be similar at all.

Imagine a site like Amazon.com, from which I’ve bought thousands of items over the last couple of decades. You could attempt to identify similar users to me based on buying patterns, but most likely in all the world there’s no one whose purchase history looks even remotely like mine. Whoever my “most similar” shopper is, he’s probably not similar to me at all, and his purchases would almost certainly make for lousy recommendations.

Item-Based Collaborative Filtering

An alternative approach is to compute similarities between interests directly. We can then generate suggestions for each user by aggregating interests that are similar to her current interests.

To start with, we’ll want to transpose our user-interest matrix so that rows correspond to interests and columns correspond to users:

interest_user_matrix = [[user_interest_vector[j]

                         for user_interest_vector in user_interest_matrix]

                        for j, _ in enumerate(unique_interests)]

What does this look like? Row j of interest_user_matrix is column j of user_interest_matrix. That is, it has 1 for each user with that interest and 0 for each user without that interest.

For example, unique_interests[0] is Big Data, and so interest_user_matrix[0] is:

[1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]

because users 0, 8, and 9 indicated interest in Big Data.

We can now use cosine similarity again. If precisely the same users are interested in two topics, their similarity will be 1. If no two users are interested in both topics, their similarity will be 0:

interest_similarities = [[cosine_similarity(user_vector_i, user_vector_j)

                          for user_vector_j in interest_user_matrix]

                         for user_vector_i in interest_user_matrix]

For example, we can find the interests most similar to Big Data (interest 0) using:

def most_similar_interests_to(interest_id):

    similarities = interest_similarities[interest_id]

    pairs = [(unique_interests[other_interest_id], similarity)

             for other_interest_id, similarity in enumerate(similarities)

             if interest_id != other_interest_id and similarity > 0]

    return sorted(pairs,

                  key=lambda (_, similarity): similarity,

                  reverse=True)

which suggests the following similar interests:

[('Hadoop', 0.8164965809277261),

 ('Java', 0.6666666666666666),

 ('MapReduce', 0.5773502691896258),

 ('Spark', 0.5773502691896258),

 ('Storm', 0.5773502691896258),

 ('Cassandra', 0.4082482904638631),

 ('artificial intelligence', 0.4082482904638631),

 ('deep learning', 0.4082482904638631),

 ('neural networks', 0.4082482904638631),

 ('HBase', 0.3333333333333333)]

Now we can create recommendations for a user by summing up the similarities of the interests similar to his:

def item_based_suggestions(user_id, include_current_interests=False):

    # add up the similar interests

    suggestions = defaultdict(float)

    user_interest_vector = user_interest_matrix[user_id]

    for interest_id, is_interested in enumerate(user_interest_vector):

        if is_interested == 1:

            similar_interests = most_similar_interests_to(interest_id)

            for interest, similarity in similar_interests:

                suggestions[interest] += similarity

    # sort them by weight

    suggestions = sorted(suggestions.items(),

                         key=lambda (_, similarity): similarity,

                         reverse=True)

    if include_current_interests:

        return suggestions

    else:

        return [(suggestion, weight)

                for suggestion, weight in suggestions

                if suggestion not in users_interests[user_id]]

For user 0, this generates the following (seemingly reasonable) recommendations:

[('MapReduce', 1.861807319565799),

 ('Postgres', 1.3164965809277263),

 ('MongoDB', 1.3164965809277263),

 ('NoSQL', 1.2844570503761732),

 ('programming languages', 0.5773502691896258),

 ('MySQL', 0.5773502691896258),

 ('Haskell', 0.5773502691896258),

 ('databases', 0.5773502691896258),

 ('neural networks', 0.4082482904638631),

 ('deep learning', 0.4082482904638631),

 ('C++', 0.4082482904638631),

 ('artificial intelligence', 0.4082482904638631),

 ('Python', 0.2886751345948129),

 ('R', 0.2886751345948129)]

For Further Exploration

§ Crab is a framework for building recommender systems in Python.

§ Graphlab also has a recommender toolkit.

§ The Netflix Prize was a somewhat famous competition to build a better system to recommend movies to Netflix users.