It’s pretty difficult to escape recommender systems in 2024. These days, we regularly get recommended things such as content from streaming services like Netflix and products from online shopping platforms like Amazon. Even if you don’t know the term recommender systems, there’s a good chance you understand what it does.
I first heard about recommender systems in the context of the Netflix prize which was an open competition in 2009 for best collaborative filtering (CF) algorithm. The challenge was to improve upon Netflix’s existing recommender system by at least 0.10 RMSE and the winner was awarded $1MM. Around that time, nearest neighbor techniques were popular CF methods however, the winners of the Netflix prize proved that matrix factorization (MF) models were superior. MF models have some similarity to singular value decomposition (SVD) but can handle sparsity often seen with recommender system datasets whereas SVD can’t.
To understand more about MF and recommender systems, in this post I’ll be creating an algorithm from scratch and evaluating it’s performance on the MovieLens 20M dataset (see here for more info). In particular, I’ll be implementing probabilistic matrix factorization (PMF) which was a seminal improvement over previous MF techniques because of it’s ability to handle sparsity and scale linearly with data (see paper here).
Intuition
Before diving into a mathematical derivation, let me provide some intuition about how MF works.
In a ratings scenario we have some number of n users and m movies. This can be represented with an n x m matrix where values of the matrix are the ranking a person gave to a movie. For simplicity, let’s say the value is a 1 if they watched and liked the movie otherwise it’s a 0. Most people will only watch a handful of movies so there’ll be many missing values in the matrix.
In the ratings matrix, each user can be described by the sparse vector of 1’s (mostly 0’s) that indicate the movies they liked (or the users that liked them in the case of movie vectors). But we can use a more compact representation similar to how one-hot encoded words are condensed via word embeddings. This allows for better comparison if we want to use something like cosine similarity to compare users, for example.
The dense representations of users and movies will form two smaller matrices, an n x d user matrix and an m x d movie matrix where d << n and d << m. You can see from a quick dimensionality analysis that the product of these two smaller matrices could be an approximation for the original ratings matrix. In fact, we could minimize the difference between the approximation and the true rating in an optimization process. Once we’ve done that, we would even have an estimate of the rating a user might have given a movie even if they didn’t watch it. And the movie with the highest likelihood of a positive rating, well we can recommend that to a user. So at a high level, that’s exactly what I’m going to do here with MF.
MAP and Objective Function
Time for more formal mathematical definitions. Like I mentioned above, we have \(n\) users and \(m\) movies as well as a ratings matrix \(R \in \mathbb{R}^{n \times m}\) such that \(R_{i,j}\) is the rating by user \(i\) for movie \(j\). The user latent matrix is \(U \in \mathbb{R}^{d \times n}\) and movie latent matrix \(V \in \mathbb{R}^{d \times m}\) where \(d\) is the latent space dimension and a tunable hyperparameter.
The author’s of the paper make an assumption on the distribution of \(R_{i,j}\), which is the product of latent vectors \(U_i\) and \(V_j\).
And the latent vectors \(U_i\) and \(V_j\) are also assumed to be normally distributed with zero mean and spherical covariance (dimensionality \(d\)). You don’t always hear this term “spherical covariance” but it essentially means all dimensions are independent with same mean and variance (i.e. IID). The variance is also tunable via the \(\lambda\) parameter (defined further below).
\[U_i \sim N(0, \sigma_u ^ 2 \mathbf{I})\]
\[V_j \sim N(0, \sigma_v ^ 2 \mathbf{I})\]
Now we can define the likelihood function for the entire ratings matrix
In a previous post I went through a process to derive a MAP estimate. This is similar. Here we can also approximate the posterior by the product of the likelihood and priors
and because summations are easier to work with than products, we take the log of the posterior (see paper for details) and maximize that to derive the MAP solution. After some re-arranging for a minimization objective, we would arrive at the following objective function as described in the paper.
where \(I_{i,j}\) is an indicator function if user \(i\) has rated movie \(j\), and \(\lambda_u = \frac{\sigma^2}{\sigma_u^2}\) and similarly \(\lambda_v = \frac{\sigma^2}{\sigma_v^2}\). The indicator function is an important addition in the objective function as it allows us to ignore the vast number of unrated movies for each user.
An interesting aside here is that the priors are essentially adding L2 regularization terms in the objective function. If we were to maximize the likelihood only (i.e. MLE) we wouldn’t have these terms.
Ok, so the paper gets us to this objective function but leaves out any description of a learning algorithm. So now we have to derive update rules in order to learn values for \(U\) and \(V\).
Derivation of Learning Algorithm
If you look at the objective function, you’ll see we have two unknown latent matrices \(U\) and \(V\). Because of this, the objective function is not convex. However, if one of the matrices is fixed, then the other turns into a convex optimization problem. So we can take turns with updates to \(U\) and \(V\) by performing an alternating coordinate descent. Note that in this context it is also referred to least squares optimization (ALS). These links by Facebook and Google were helpful for confirming this important optimization detail.
To find the update rules, we need to derive partial derivatives for \(U_i\) and \(V_j\) with respect to the objective function. So, time to dive into a little matrix calculus with a derivation for \(\frac{\partial E}{\partial U_i}\).
There is another interesting aside here about the priors. If we were to do MLE instead of MAP we’d get a similar update rule except for the \(\lambda_v \mathbf{I}\) term. If you’ve ever tried to implement your own least squares solution in closed form, you may remember that regression coefficients require inverting the matrix \((X^\intercal X)^{-1}\). Sometimes your data doesn’t behave well and you end up with a non-invertible matrix (because your determinant is zero). The \(\lambda_v \mathbf{I}\) term is essentially adding some values to the diagonal of the matrix which makes it less likely to have a zero determinant (i.e. more stable).
Ok, so armed with these update rules now we can dive into some code.
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
ratings_df = pd.read_csv(
(10000054, 4)
userId
movieId
rating
timestamp
0
1
122
5.0
838985046
1
1
185
5.0
838983525
2
1
231
5.0
838983392
3
1
292
5.0
838983421
4
1
316
5.0
838983392
# ratings are in increments of 0.5ratings_df.rating.value_counts()
# sanity check - all movies have rating and all users have rated as expectedusers_not_rated = np.all(np.isnan(R), axis=1).sum()movies_not_rated = np.all(np.isnan(R), axis=0).sum()assert0== users_not_rated == movies_not_rated
# ratio missing valuesn, m = R.shapenans = np.isnan(R).sum()num_ratings = m * n - nansprint(f"total number of users: {n}")print(f"total number of movies: {m}")print(f"total number of movie ratings: {num_ratings}")print(f"total number of possible ratings: {n * m}")print(f"missing values in ratings matrix: {100*round(nans / (m * n), 3)}%")
total number of users: 69878
total number of movies: 10677
total number of movie ratings: 10000054
total number of possible ratings: 746087406
missing values in ratings matrix: 98.7%
# indices of ratingsrating_idxs = np.argwhere(np.isfinite(R)) # 2D array [[i ,j], ...]# randomly select 15% of rating indices to keep as testtrain_slice = np.random.choice(num_ratings, size=int(num_ratings *0.85), replace=False)test_slice = np.setdiff1d(np.arange(num_ratings), train_slice)train_idxs = rating_idxs[train_slice, :]test_idxs = rating_idxs[test_slice, :]train_idxs.shape, test_idxs.shape
((8500045, 2), (1500009, 2))
# fill test indices with nan in training dataR_train = R.copy()for i, j in test_idxs: R_train[i, j] = np.nanassert np.isnan(R_train).sum() == nans + test_idxs.shape[0]
# normalizing the training data to zero mean and sigma of 1 for simplicityii_train = train_idxs[:, 0].flatten()jj_train = train_idxs[:, 1].flatten()R_train_mu = R_train[ii_train, jj_train].mean()R_train_sigma = R_train[ii_train, jj_train].std()R_train = (R_train - R_train_mu) / R_train_sigmaprint(f"R train mu: {round(R_train_mu, 2)}\t R train sigma: {round(R_train_sigma, 2)}")
R train mu: 3.51 R train sigma: 1.06
# paper uses:# d=30 with original netflix dataset 100M+ ratings, ~500k users, ~20k movies# adaptive prior w/ spherical covariances lambda U = 0.01, lambda V = 0.001# constrained PMF lambda = 0.002d =30lambda_u =20# higher better, just some random experimentation herelambda_v =20sigma_r =1# np.std(R[~np.isnan(R)])sigma_u = np.sqrt(sigma_r**2/ lambda_u) # assume sigma=1 for ratings matrixsigma_v = np.sqrt(sigma_r**2/ lambda_v)# randomly generate values for user and movie latent matrices# U can be empty since updated firstU_mu = np.zeros(d)U_cov = np.identity(d) * sigma_u**2# sphericalU = np.random.multivariate_normal(U_mu, U_cov, size=n).TV_mu = np.zeros(d)V_cov = np.identity(d) * sigma_v**2V = np.random.multivariate_normal(V_mu, V_cov, size=m).T# D x M/NU.shape, V.shape
((30, 69878), (30, 10677))
# function to update U_i latent vectordef update_U_i(i, ratings_matrix, user_matrix, movie_matrix, lambda_u):# valid j (i.e. movies this user rated) j_idxs = np.argwhere(np.isfinite(ratings_matrix[i, :])).flatten()# skip if no movies rated in train setiflen(j_idxs) ==0:return# running summation sum_rv = np.zeros(d) # vector: weighted V_j by R_ij sum_vv = np.zeros((d, d)) # matrix: outer product V_jfor j in j_idxs: V_j = movie_matrix[:, j] sum_rv += V_j * ratings_matrix[i, j] sum_vv += np.outer(V_j, V_j) user_matrix[:, i] = sum_rv @ np.linalg.inv(sum_vv + np.identity(d) * lambda_u)def update_V_j(j, ratings_matrix, user_matrix, movie_matrix, lambda_v):# valid i (i.e. users that rated this movie) i_idxs = np.argwhere(np.isfinite(ratings_matrix[:, j])).flatten()# skip if no users rated in train setiflen(i_idxs) ==0:return sum_ru = np.zeros(d) # vector: weighted U_i by R_ij sum_uu = np.zeros((d, d)) # matrix: outer product U_ifor i in i_idxs: U_i = user_matrix[:, i] sum_ru += U_i * ratings_matrix[i, j] sum_uu += np.outer(U_i, U_i) movie_matrix[:, j] = sum_ru @ np.linalg.inv(sum_uu + np.identity(d) * lambda_v)
# use logistic function since ratings prediction is zero mean (same as paper)def sigmoid(x):return1/ (1+ np.exp(-x))# map sigmoid to range [1, 5] (same as paper)# there are 0.5 ratings so this isn't perfectdef sigmoid_to_rating(x, k=5):return (k -1) * x +1
The baseline RMSE that Netflix had achieved before the competition was 0.95. This beats that by a small margin but the dataset is different so it’s not an apples to apples comparison. Also, the range of ratings goes down to 0.5 in the MovieLens dataset whereas prediction mapping was consistent with the paper and Netflix dataset which had a range of [1, 5]. This was mostly an exercise in understanding PMF and implementing an algorithm based off an objective function in a research paper. But, if we wanted to squeeze out more performance there are several things we could do:
map predictions to [0.5, 5]
use adaptive covariance (biggest gains)
tune the lambda regularization coefficients (I started with 1 and only went up/down by an order of magnitude)
separate covariances for users and movies
diagonal covariance matrix (instead of spherical) to account for different variation between latent features