Making Recommendations
Recommender is what you experience when sites make recommendations whether a movie, a product….
Movie Recommendations
Let’s say we have a small dataset of movie reviews:
- nu = no of users
- nm = no of movies
- r(i,j) =1 if user j has rated movie i
- y(i,j) = rating given by user j to movie i which is defined only if r(i,j)=1
- Note not every user has rated all the movies
Let’s predict how users will rate the movies that have not been rated
Before we proceed let’s raise another possibility:
- What if we have more features regarding these movies such as (x1=romance & x2=action) would that help predict what the users will rate the movies
- What if we don’t have any additional features?
We’ll cover the first option first:
Added Features
- The values of whether it is that features vary from 0 -1 with 0 being least or not and 1 being definitely
- Here is an example, if a movie has x1=0.9 it means it is highly romantic and if x2=0.1 means it is not an action movie
- Let n=number of features
Let’s predict the rating of Cute puppies of love for Alice
- we will use the linear regression algorithm W * X(i) + b
- Let’s pretend the value of w= [5, 0] and b =0 and from the table X(3)=[0.99, 0]
- Note since we have two features X that’s why we needed two values for W
- So the algorithm will give us 4.95
- Which seems reasonable since she rated the two other romantic movies high as well
- The superscript of 1 stands for the fact that Alice if the first user in the dataset, and so each user will have their own 1-4 since we have 4 users
- So the generalized algorithm to predict the rating for any user will have the (j) superscript as shown in the image
Cost Function
- As you see below it will be the value predicted - the actual value rated
- But since there are some cases when users have not rated all the movies so how can we sum over all the data rows, well we won’t
- This is it differs from linear regression because we only sum over the RATED movies
- To learn the parameters we need to minimize the cost function
Blank Features
What if we haven’t learned the values of the added features? Let’s say a movie just came out and for some reason nobody has rated the added features.
- Can we predict what each user will rate the added features
- Let’s pretend we know the values of w and b for each user as shown in the image
- As you see all the values of b are 0 so let’s ignore b from the formula
- So if you have all the ratings from all users for a certain movie you can predict x1
- Then you can predict x2
- And now that you have both x values you can try to predict the ratings that are missing for all the users to all movies as shown above
- If you don’t have all ratings to a movie from all users you cannot predict x values
Cost Function
Collaborative Filtering
Let’s put both scenarios detailed above together.
Gradient Descent
Binary Labels
Sometimes we just ask the user if they like something or not, this is a simple binary model. So we can revert back to what we’ve already learned to learn if we should recommend an item or not.
- Some users have not commented on the item
- Here is the breakdown
- Now that we are using binary values we just need to predict the probability which is a binary classification model
Cost Function
Now the cost function for binary classification will become:
Mean Normalization
It will make the algorithm make better prediction and speed it up if we normalize the reviews by the users
- So let’s take all the data/reviews and put them in a 2D matrix
- We take all the rows and average them out, if we have a movie that hasn’t been shown to anyone we can average out the columns instead
- Of course we skip the ? marks
- Now we get a vector of average ratings
- Then we take the original ratings and subtract the mean/average values from it and now we have a new matrix
- Then to predict the new rating we use the same formula as we did above but in order to not have a negative rating (since the rating system is from 0-5) we must add back the mean/average value for each row/movie which is \(\mu\) i
- So user 5 will have a rating of 2.5 for the first movie i=1
TF Implementation
- First we specify the optimizer: Adam in this case
- Choose number of iterations
- Loop through the number of iterations we call the GradientTape which will record the values
- In the loop you will see the cost function J which takes inputs: X, W, B, Ynorm are the normalized mean. Which values have a rating, number of users, number of movies, as well as the regularization parameter \(\lambda\)
- This cost_value will figure out the derivatives
- Then we feed the derivatives into the gradient to give us the gradient
- We run it, and repeat it by updating the values of the variables to minimize the loss using the optimizer we specified