Collaborative filtering

In the modern world, one often encounters the problem of recommending goods or services to users of an information system. In the old days, recommendations were made with a summary of the most popular products: this can be observed even now by opening the same Google Play . But over time, such recommendations began to be superseded by targeted (targeted) offers: users are recommended not just popular products, but those products that they surely will like. Not so long ago, Netflix held a contest with a prize pool of $ 1 million, whose task was to improve the algorithm for recommending films ( more ). How do similar algorithms work?

This article discusses the collaborative filtering algorithm for user similarity, determined using the cosine measure, as well as its implementation in python.

Input data

Suppose we have a matrix of ratings that users submitted to products, for simplicity of presentation, products are assigned numbers 1-9:
You can set it using a csv file in which the first column is the user name, the second is the product identifier, and the third is the user’s rating. Thus, we need a csv file with the following contents:

To begin with, we will develop a function that will read the above csv file. To store the recommendations, we will use the standard dict data structure for python: each user is assigned a directory of his ratings of the form “product”: “rating”. The following code will turn out:
import csv
def ReadFile (filename = "<csv_file_location>"):
    f = open (filename)
    r = csv.reader (f)
    mentions = dict()
    for line in r:
        user    = line[0]
        product = line[1]
        rate    = float(line[2])
        if not user in mentions:
            mentions[user] = dict()
        mentions[user][product] = rate
    return mentions

Similarity measure

Intuitively, to recommend a product to user # 1, you need to choose from products that some 2-3-4-etc. Users like, which are most similar in their ratings to user # 1. How to get a numerical expression of this "similarity" of users? Let's say we have M products. The ratings set by an individual user are a vector in the M-dimensional space of products, and we can compare vectors. Among the possible measures are the following:
  1. Cosine measure
  2. Pearson correlation coefficient
  3. Euclidean distance
  4. Tanimoto coefficient
  5. Manhattan distance, etc.

In more detail, various measures and aspects of their application, I am going to consider in a separate article. For now, suffice it to say that in recommender systems the cosine measure and Tanimoto's correlation coefficient are most often used. Let us consider in more detail the cosine measure, which we are going to implement. The cosine measure for two vectors is the cosine of the angle between them. From the school course of mathematics, we remember that the cosine of the angle between two vectors is their scalar product divided by the length of each of the two vectors: We
implement the calculation of this measure, not forgetting that we have many user ratings presented in the form of a “product” dict : "Rating"
def distCosine (vecA, vecB):
    def dotProduct (vecA, vecB):
        d = 0.0
        for dim in vecA:
            if dim in vecB:
                d += vecA[dim]*vecB[dim]
        return d
    return dotProduct (vecA,vecB) / math.sqrt(dotProduct(vecA,vecA)) / math.sqrt(dotProduct(vecB,vecB))

During the implementation, the fact was used that the scalar product of the vector itself gives the square of the length of the vector - this is not the best solution from the point of view of performance, but in our example the speed is not fundamental.

Collaborative Filtering Algorithm

So, we have a matrix of user preferences and we are able to determine how two users are similar to each other. Now it remains to implement the collaborative filtering algorithm, which consists in the following:
  1. Select L users whose tastes are most similar to the tastes of the one in question. To do this, for each of the users, it is necessary to calculate the selected measure (in our case, cosine) in relation to the user in question, and select the L largest ones. For Ivan, from the table above, we get the following values:
  2. For each of the users, multiply his ratings by the calculated measure, so the ratings of more “similar” users will have a stronger effect on the final position of the product, which can be seen in the table in the illustration below
  3. For each product, calculate the sum of the calibrated estimates of L of the closest users, divide the resulting amount by the sum of measures L of the selected users. The sum is shown in the illustration in the “sum” line, the final value in the “result” line.
    Gray indicates the columns of products that have already been evaluated by the user in question and re-offer them to him does not make sense

In the form of a formula, this algorithm can be represented as
where the function sim is the measure of similarity of two users that we have chosen, U is the set of users, r is the assigned grade, k is the normalization coefficient:
Now it remains only to write the corresponding code
import math
def makeRecommendation (userID, userRates, nBestUsers, nBestProducts):
    matches = [(u, distCosine(userRates[userID], userRates[u])) for u in userRates if u <> userID]
    bestMatches = sorted(matches, key=lambda(x,y):(y,x), reverse=True)[:nBestUsers]
    print "Most correlated with '%s' users:" % userID
    for line in bestMatches:
        print "  UserID: %6s  Coeff: %6.4f" % (line[0], line[1])    
    sim = dict()
    sim_all = sum([x[1] for x in bestMatches])
    bestMatches = dict([x for x in bestMatches if x[1] > 0.0])        
    for relatedUser in bestMatches:
        for product in userRates[relatedUser]:
            if not product in userRates[userID]:
                if not product in sim:
                    sim[product] = 0.0
                sim[product] += userRates[relatedUser][product] * bestMatches[relatedUser]
    for product in sim:
        sim[product] /= sim_all
    bestProducts = sorted(sim.iteritems(), key=lambda(x,y):(y,x), reverse=True)[:nBestProducts]
    print "Most correlated products:"
    for prodInfo in bestProducts:    
        print "  ProductID: %6s  CorrelationCoeff: %6.4f" % (prodInfo[0], prodInfo[1])
    return [(x[0], x[1]) for x in bestProducts]

To check its health, you can run the following command:
rec = makeRecommendation ('ivan', ReadFile(), 5, 5)

Which will result in the following result:


We looked at an example and implemented one of the simplest options for collaborative filtering using a cosine similarity measure. It is important to understand that there are other approaches to collaborative filtering, other formulas for calculating product ratings, other measures of similarity ( article , section “See also”). Further development of this idea can be carried out in the following areas:
  1. Optimization of used data structures . When storing data in python as a dict, each time a particular value is accessed, a hash is calculated and the situation becomes worse the longer the name string. In practical tasks, sparse matrices can be used to store data , and instead of text user names and product names, use numerical identifiers (number all users and all products)
  2. Performance optimization . Obviously, calculating the recommendation for each user contact is extremely costly. There are several workarounds for this problem:
    • Clustering users and calculating measures of similarity only between users belonging to the same cluster
    • Calculation of product-product similarity factors. To do this, you need to transpose the use-product matrix (you get the product-user matrix), after which for each of the products calculate the set of the most similar products using the same cosine measure and remembering the k closest ones. This is a rather time-consuming process, so you can produce it once every M hours / days. But now we have a list of products similar to this one, and multiplying user ratings by the value of the product similarity measure, we get a recommendation for O (N * k) , where N - number of user ratings
  3. Selection of measures of similarity . The cosine measure is one of the most commonly used, but the choice of measure should be made only by the results of the analysis of the system
  4. Modification of the filtering algorithm . Perhaps another filtering algorithm will give more accurate recommendations in a particular system. Again, a comparison of various algorithms can only be done when applied to a specific system.


  1. We program collective mind
  2. Collaborative Wiki Filtering
  3. Cosine measure
  4. Sparse Matrices