Michael Data

People tend to rate things that they like, which adds a slight bias. Given data from past ratings or records, we want to recommend similar items.

e.g. Amazon, pandora, stumbleupon, netflix, last.fm , Google, YouTube, Xbox Live.


Data with users $u$ and iterms $i$.
Can represent data as an NxM sparse matrix. N = number of users. M = number of items.
The data can be binary and represent whether the user has purchased or otherwise somehow interacted with the item. Alternatively, the data can also represent explicit user ratings or other granular information.

May also have auxiliary data on users e.g. social network or demographic information, and data on items such as similar content, attributes, cross references.

Challenges Common to Recommender Systems

Data sparsity

Users with very little historical data and few user attributes.
Items with little or no content information.

The Cold Start problem

Extreme case of sparse data: How do you make recommendations about new users or new items?
Different algorithms deal with this problem differently.

Historical Data

It isn't clear that historical data is a meaningful way to evaluate recommender algorithms, especially purchases. e.g. recommenging pizza to the person buying beer, when the two are almost always purchased together anyway.

Gathering Ratings Data

Explicit Ratings

More difficult to get from users. They often don't want to spend time assigning ratings. Humans are not very good at assigning numerical ratings.

Bias: some users may have consistently lower ratings than others.
Variance: users may assign different ratings to the same item at different times.

Implicit Ratings

e.g. counts of user views of an item profile page. May not be an accurate representation of the user's interest in the item.

Evaluation Metrics
General Approaches

Content-based Recommendations

Use attributes and features of the items to recommend similar items. This approach is to recommend items to user $U$ that are similar to the items $U$ is known to already like. Similarity is computed from item attributes. The similarity function and features are critical to success.

The content-based approach is more robust to the cold start problem because it is easier to gather basic information about items.

Collaborative filtering

Use the ratings matrix to recommend items. Two broad types:

  • Nearest-neighbor methods
  • Matrix factorization methods

Collaborative filtering works for almost any kind of item without feature selection. It is also simple to implement. Good to use as a first attempt.

Not good for a “cold start”. It also suffers from popularity biases, recommending things that are popular overall and fumbling over users with unique tastes.

Hybrid Methods

Use content data and collaborative filtering together.

May simply implement both methods and combine with a linear filter. Learn weights by cross-validation.

Common Techniques

Nearest-Neighbor Algorithms

Use nearest neighbor to find users similar to the target user. Over the items which have been rated by both users, compute a similarity weight $w_{a,u}$ e.g. linear correlation coefficient between these vectors. Find $K$ users that are most similar.

Another approach is to use item-item similarity. Unlock user-user similarity, this can be done offline when users are not waiting for a fast response.

Research indicates that item-item methods produce better results than user-user methods. No established theory to explain this, but theories: There may be more stability across items than users. There may be more idiosyncratic users than items. Item dimensionality is smaller; there are fewer items than users.

Matrix Factorization

User Experience
Netflix Prize

In 2006, the Netflix Prize was a 1 million dollar competition to perform movie recommendations. The winner had to beat their in-house method by 10% using a squared-error metric. About 40,000 teams from around the world entered.

The data contained 100 million ratings on 17,700 movies and 480,000 users. The users were chosen randomly with at least 20 ratings each. Some noise was added in order to help with anonymity. The most recent ratings from some of the users was withheld for testing.

The contest led to greater interest and familiarity with large data sets.

  • One user had rated thousands of movies but had a mean rating score of 1.2

There was a 50,000-dollar prize given to the leading team each year of the competition, which went from 2006 to 2009. The winning team had to provide a 10% improvement over Netflix's own system in order to win the final 1 million-dollar prize. The first score to beat that 10% metric triggered a final 30-day period for others to compete before the competition ended. The rights of the winning software was to provide a non-exclusive license to Netflix and publically publish the winning algorithm.

Netflix provided this competition at some risk. There would be bad publicity if the prize wasn't won by anyone, and embarrasing if it was won within the first few days. There was also the risk that the algorithmic solutions used to win would not be practically useful.

General machine learning lessons:
Model averaging was useful, including linear model combinations, neural network combinations, and gradient boosted decision trees. Latent factor models were very useful on this problem.

Openness lessons: the public leaderboard and forum encouraged collaboration and continued efforts. Privacy can't be taken too seriously.