Dynamic recommendation and personalization systems are starting to become ubiquitous in industry. With the advent (and commoditization) of cutting edge ML and DL techniques, there is almost no reason not to leverage the data you have about your customers to provide them with more relevant prices, promotions, and content.

The Netflix ratings problem is no secret, nor is the success they’ve enjoyed with matrix factorization / vector similarity models. They’ve even gone so far as to say that their recommendation and personalization engine adds one billion dollars to their business annually:

"If one member in this tiny island expresses an interest for anime, then we're able to map that person to the global anime community", Carlos Gomez-Uribe, VP of product innovation at Netflix, told Tech Insider in February. 

In an academic paper penned by Gomez-Uribe and Netflix's Chief Product Officer Neil Hunt, they assert that "the combined effect of personalization and recommendations save us more than 1 billion per year."

The math

As a foundational model, we quickly spell out the classical SVD matrix factorization model for convenience.

Let $r_{ui}$ be the theoretical rating of the $i$th movie by the $u$th user. If this rating has been observed, then $r_{ui} \in R_{train}$. Then we model the unobserved ratings $\hat{r_{ui}}$ as:

\[\begin{equation*} r_{ui}=\mu+q_i^Tp_u \end{equation*}\]

The vectors $q_i$ and $p_u$ are the item and user vectors, respectively, that correspond to the reduced rank representation of the ratings matrix. The goal of the SVD model is to learn the set of vectors $q_i \in Q$ and $p_u in U$ such that we minimize the following loss function:

\[\begin{equation*} \min_{Q,P} \sum_{r_{ui} \in R_{train}}^{}{(r_{ui} - \hat{r}_{ui})^2 + \lambda(\|q_i\|^2 + \|p_u\|^2)} \end{equation*}\]

The left part of the equation attempts to minimize the squared loss subject to a regularization constraint on the parameters - this will make our algorithm less likely to overfit. An equivalent way to view the loss function is as a minimization of the reconstruction residuals when differencing the true ratings matrix $R$ against its SVD-reconstructed version:

\[\begin{equation*} \min_{Q,P} e_{ui}^{2} = r_{ui}-\sum_{k=1}^{l}{q_{ik}p_{ku}} \end{equation*}\]

with the regularization penalty tacked on at the end.

This minimization is typically performed with a very simple gradient descent step:

\[\begin{equation*} p_u \leftarrow p_u + \gamma(e_{ui}\cdot q_i - \lambda p_u)\\ q_i \leftarrow q_i + \gamma(e_{ui}\cdot p_u - \lambda q_i) \end{equation*}\]

where $p_u$ and $q_i$ are alternatingly updated, with learning rate $\gamma$.

Weaknesses with the basic approach

Unfortunately, in practice there are some weaknesses of the classical SVD approach:

  • No explicit inclusion of contextual features. The fundamental assumption of SVD is that if I look like someone (in terms of our watching behavior), I will continue to look like that person in the future. While this is a good assumption for frequent users that have watched many movies, there often isn’t enough information to discriminate between less frequent users.
    • For example, let’s say we have observed User A and B to have both watched Indiana Jones, and nothing else.
    • Despite there being clearly not enough data, the model will have the exact same scores for every other movie for these two users. Ideally we can augment these two users with more contextual information to better discriminate their behavior.
  • No explicit treatment of cold start. The SVD model can only work for users and movies that have existing data - it cannot infer average case behavior for new users or movies. For industries with a long tail of users or products, this can be a show-stopper.

Factorization Machines

Factorization machines are a powerful way to include contextual information in your model, gracefully handle cold start, while still reaping the benefits of matrix factorization and vector similarity.

Intuitively, factorization machines are a clever regression formulation of a second (or arbitrary) order vector space model. I won’t go into the mathematical details as it is quite dense. But we model the unobserved ratings in a similar way as before, albeit with a new notation:

\[\begin{equation*} r_{ui}=\mu + \sum_{i=1}^{n}{w_i x_i} + \sum_i^n \sum_{j=i+1}^n \langle v_i,v_j\rangle x_{i}x_{j} \end{equation*}\]

This essentially simplifies to the classical SVD formulation if we assume that only features $i$ and $u$ (corresponding to a user and item dummy) are nonzero for

\[\hat{r}_{ui}=w_0+w_i+w_u+\langle v_u,v_i \rangle\]

The really nice thing about factorization machines is: because they’re formulated as regression problems, they can be solved like regression problems! A variety of packages (libFM, fastFM, pyFM) exist that allow easy and quick prototyping of a factorization machine.

A quick experiment

We experiment with the 100K movie lens dataset here, comparing two algorithms:

  • Regularized SVD, solved with alternating least squares
  • Second-order regularized factorization machine

On top of historical ratings, the factorization machine leverages two additional sources of information:

  • User information (age, gender, occupation, zip)
  • Movie information (title, release date, categories)

A bootstrapped 2-fold cross validated estimate of the RMSE reveals that the factorization machine is 4-5% better than SVD:

Not as compelling as we’d like, but still a sizable improvement. And how did we train it? I won’t spell out all the gory details, but after loading the data, we converted the dataset into a CSC (compressed sparse column) matrix format. This ensures that we don’t waste memory and compute on zero elements (recall the gradient updates would be 0 on these elements):

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer()
X_train = v.fit_transform(train_data)
X_test = v.transform(test_data)

Finally, in order to see if the movies that are being recommended actually make sense, we did a quick loop to spot check for some users:

  • Their top 10 historical movies
  • Their top 5 recommended unseen movies (as defined by the test set)

Below is some output for an example user. It reveals that the recommendations do make some sense, and generally speaking, score 4-5 for the unseen movies.

And for another user:

I swear I didn’t cherry pick here, these are two completely random users I pulled from the test set :-D It is comforting to see that the blue and orange bars generally line up, and that the recommended movies are generally in the same vein as the historical favorites.


Why Netflix thinks its personalized recommendation engine is worth $1 billion per year

Factorization Machines for Recommendation Systems