Making Recommendations

Recommender is what you experience when sites make recommendations whether a movie, a product….

Movie Recommendations

Let’s say we have a small dataset of movie reviews:

n_u = no of users
n_m = no of movies
r(i,j) =1 if user j has rated movie i
y^(i,j) = rating given by user j to movie i which is defined only if r(i,j)=1
Note not every user has rated all the movies

Let’s predict how users will rate the movies that have not been rated

Before we proceed let’s raise another possibility:

What if we have more features regarding these movies such as (x₁=romance & x₂=action) would that help predict what the users will rate the movies
What if we don’t have any additional features?

We’ll cover the first option first:

Added Features

The values of whether it is that features vary from 0 -1 with 0 being least or not and 1 being definitely
Here is an example, if a movie has x₁=0.9 it means it is highly romantic and if x₂=0.1 means it is not an action movie
Let n=number of features

Let’s predict the rating of Cute puppies of love for Alice

we will use the linear regression algorithm W * X⁽ⁱ⁾ + b
Let’s pretend the value of w= [5, 0] and b =0 and from the table X⁽³⁾=[0.99, 0]
Note since we have two features X that’s why we needed two values for W
So the algorithm will give us 4.95
Which seems reasonable since she rated the two other romantic movies high as well
The superscript of 1 stands for the fact that Alice if the first user in the dataset, and so each user will have their own 1-4 since we have 4 users
So the generalized algorithm to predict the rating for any user will have the (j) superscript as shown in the image

Cost Function

As you see below it will be the value predicted - the actual value rated
But since there are some cases when users have not rated all the movies so how can we sum over all the data rows, well we won’t
This is it differs from linear regression because we only sum over the RATED movies

To learn the parameters we need to minimize the cost function

Blank Features

What if we haven’t learned the values of the added features? Let’s say a movie just came out and for some reason nobody has rated the added features.

Can we predict what each user will rate the added features
Let’s pretend we know the values of w and b for each user as shown in the image
As you see all the values of b are 0 so let’s ignore b from the formula
So if you have all the ratings from all users for a certain movie you can predict x₁
Then you can predict x₂
And now that you have both x values you can try to predict the ratings that are missing for all the users to all movies as shown above
If you don’t have all ratings to a movie from all users you cannot predict x values

Cost Function

Collaborative Filtering

Let’s put both scenarios detailed above together.

Gradient Descent

Binary Labels

Sometimes we just ask the user if they like something or not, this is a simple binary model. So we can revert back to what we’ve already learned to learn if we should recommend an item or not.

Some users have not commented on the item
Here is the breakdown

Now that we are using binary values we just need to predict the probability which is a binary classification model

Cost Function

Now the cost function for binary classification will become:

Mean Normalization

It will make the algorithm make better prediction and speed it up if we normalize the reviews by the users

So let’s take all the data/reviews and put them in a 2D matrix
We take all the rows and average them out, if we have a movie that hasn’t been shown to anyone we can average out the columns instead
Of course we skip the ? marks
Now we get a vector of average ratings
Then we take the original ratings and subtract the mean/average values from it and now we have a new matrix
Then to predict the new rating we use the same formula as we did above but in order to not have a negative rating (since the rating system is from 0-5) we must add back the mean/average value for each row/movie which is \(\mu\) _i
So user 5 will have a rating of 2.5 for the first movie i=1

TF Implementation

First we specify the optimizer: Adam in this case
Choose number of iterations
Loop through the number of iterations we call the GradientTape which will record the values
In the loop you will see the cost function J which takes inputs: X, W, B, Ynorm are the normalized mean. Which values have a rating, number of users, number of movies, as well as the regularization parameter \(\lambda\)
This cost_value will figure out the derivatives
Then we feed the derivatives into the gradient to give us the gradient
We run it, and repeat it by updating the values of the variables to minimize the loss using the optimizer we specified

Making Recommendations

Movie Recommendations

Added Features

Cost Function

Blank Features

Cost Function

Collaborative Filtering

Gradient Descent

Binary Labels

Cost Function

Mean Normalization

TF Implementation

Find Related Items

Limitations