“Where the money is made.” -Padhraic Smyth

Most learning algorithms can be described as a combination of:

* Model

* Objective Function

* Optimization Method

Classification

* Spam email

* Classify sentiment of a product review

Regression

* Predict a real-valued number

Ranking

* Find most likely candidates from a group

Training data: (features, labels) pairs typically represented in a table.

Want to learn a **model** to predict a label given features.

The model is typically represented as a function where are input features and is a vector of parameters.

The quality of a model is evaluated with an **error function**.

* e.g. sum of squared error: .

Goal is to minimize the total error on training data. This is an optimization problem. There is occasionally a direct solution via e.g. linear algebra. Typically a gradient approach of some kind is needed.

Minimizing training error doesn't give the best possible prediction of future data. Increasing test error during training is called **overfitting**. Overfitting can be controlled by switching to a simpler model.

In practice, predictive models are limited by the Bias-Variance Tradeoff.

Linear weighted sums of the input variables (linear regression).

Non-linear functions of linear weighted sums (logistic regression, neural networks, GLMs).

Thresholded functions (decision trees).

To improve a model, model performance is important.

Compare to a baseline. Relative to the error rate, you can measure the reduction in error provided by switching to the model in question. Want to establish that the reduction in error is not due to random chance.

e.g. in classification, the simplest baselin is to always predict the most likely class, ignoring .

Alternately, examining a confusion matrix can explain mistakes or patterns in the classifier.

Regression

* Squared Error (L2)

* Absolute Error (L1)

* Robust loss, log-loss, log-likelihood

Classification

* classification error

* margin

* log-loss, log-likelihood

Frequently use gradient descent.