Michael Data

Essentially the same idea as principal component analysis. Factorize the large N-by-M matrix into “skinny” matrices of size N-by-k and k-by-M. The factorization results in an approximation via singular value decomposition.

Each of the $k$ dimensions is referred to as a factor. Both dimensions of the data can then be represented in this new latent feature space. Finding the $k$ factors is equivalent to performing SVD, which has complexity $O(mn^2 + n^3)$.

$ r_{ui} \equiv q_i^t p_u $
$ min_{q,p} \left{ \sum_{(u,i) \in R} ( r_{ui} - q_i^t p_u)^2 \right} $

There are systematic effects that have to be taken out:

  • * $\mu$ The overall mean rating
  • * $b_u$ The mean rating for user u
  • * $b_i$ The mean rating for movie i

These systematic effects can be modeled in baseline predictors.

The new prediction model becomes $r_{ui} \equiv \mu + b_u + b_i + \text{user-movie interactions}$.

Regularization turns out to be very important. Assists in learning, but you can also regularize incomplete data toward the center of the latent space.
$ min_{q,p} \left{ \sum_{(u,i) \in R} ( r_{ui} - [\mu + p_u + b_u + b_i + q_i^t p_u])^2  + \lambda (|q_i|^2 + |p_u|^2 |b_u|^2 |b_i|^2) \right}$

One method for learning the approximation components is gradient descent. The parameters to learn are the biases $\mu, b_u, b_i$.

The matrix factorization can also be learned with gradient descent. Rather than using the linear algebra SVD methods to do it, the factors are treated as an additional set of $k(N+M)$ parameters to learn.
Update equations with regularization are:

  • * $p_u^{new} = p_u^{old} + \gamma e_{u,i}q_i - \lambda p_u^{old}$
  • * $q_i^{new} = q_i^{old} + \gamma e_{u,i}p_u - \lambda q_i^{old}$