Michael Data

“Where the money is made.” -Padhraic Smyth
Most learning algorithms can be described as a combination of:
* Model
* Objective Function
* Optimization Method

\begin{tabular}{c c c c}
Method & Model & Objective Function & Optimization Method \\ \hline
Linear Regression & Weighted Sum & Squared Error & Linear systems of equations \\
Logistic Regression & h(Weighted Sum) & Log-Likelihood & Iterative, system of equations \\
Neural Network & Weighted sum of logistic regressions & Squared error & Gradient-based \\
Support vector machine & Sparse weighted sum & Margin & Convex optimization \\
Decision tree & Binary tree & Classification error & Greedy search over trees


* Spam email
* Classify sentiment of a product review

* Predict a real-valued number

* Find most likely candidates from a group

or Ensemble Methods


Training data: (features, labels) pairs typically represented in a table.

Want to learn a model to predict a label given features.
The model is typically represented as a function $f(x;\alpha)$ where $x$ are input features and $\alpha$ is a vector of parameters.

The quality of a model is evaluated with an error function.
* e.g. sum of squared error: $E_{train}(\alpha) = \sum_i \left[y_i - f(x_i;\alpha)]^2$.

Empirical Learning

Goal is to minimize the total error on training data. This is an optimization problem. There is occasionally a direct solution via e.g. linear algebra. Typically a gradient approach of some kind is needed.

Generalization to New Data


Minimizing training error doesn't give the best possible prediction of future data. Increasing test error during training is called overfitting. Overfitting can be controlled by switching to a simpler model.

Bias and Variance

In practice, predictive models are limited by the Bias-Variance Tradeoff.


Linear weighted sums of the input variables (linear regression).
Non-linear functions of linear weighted sums (logistic regression, neural networks, GLMs).
Thresholded functions (decision trees).

To improve a model, model performance is important.
Compare to a baseline. Relative to the error rate, you can measure the reduction in error provided by switching to the model in question. Want to establish that the reduction in error is not due to random chance.
e.g. in classification, the simplest baselin is to always predict the most likely class, ignoring $x$.
Alternately, examining a confusion matrix can explain mistakes or patterns in the classifier.

Objective Functions

* Squared Error (L2)
* Absolute Error (L1)
* Robust loss, log-loss, log-likelihood

* classification error
* margin
* log-loss, log-likelihood

Optimization Methods

Frequently use gradient descent.