“Where the money is made.” -Padhraic Smyth Most learning algorithms can be described as a combination of: * Model * Objective Function * Optimization Method

<latex>
\begin{tabular}{c c c c}
Method & Model & Objective Function & Optimization Method

\hline
Linear Regression & Weighted Sum & Squared Error & Linear systems of equations

Logistic Regression & h(Weighted Sum) & Log-Likelihood & Iterative, system of equations

Neural Network & Weighted sum of logistic regressions & Squared error & Gradient-based

Support vector machine & Sparse weighted sum & Margin & Convex optimization

Decision tree & Binary tree & Classification error & Greedy search over trees
\end{tabular}
</latex>

Classification * Spam email * Classify sentiment of a product review

Regression * Predict a real-valued number

Ranking * Find most likely candidates from a group

Training data: (features, labels) pairs typically represented in a table.

Want to learn a **model** to predict a label given features.
The model is typically represented as a function $f(x;\alpha)$ where $x$ are input features and $\alpha$ is a vector of parameters.

The quality of a model is evaluated with an **error function**.
* e.g. sum of squared error: $E_{train}(\alpha) = \sum_i \left[y_i - f(x_i;\alpha)]^2$.

Goal is to minimize the total error on training data. This is an optimization problem. There is occasionally a direct solution via e.g. linear algebra. Typically a gradient approach of some kind is needed.

Minimizing training error doesn't give the best possible prediction of future data. Increasing test error during training is called **overfitting**. Overfitting can be controlled by switching to a simpler model.

In practice, predictive models are limited by the Bias-Variance Tradeoff.

Linear weighted sums of the input variables (linear regression). Non-linear functions of linear weighted sums (logistic regression, neural networks, GLMs). Thresholded functions (decision trees).

To improve a model, model performance is important. Compare to a baseline. Relative to the error rate, you can measure the reduction in error provided by switching to the model in question. Want to establish that the reduction in error is not due to random chance. e.g. in classification, the simplest baselin is to always predict the most likely class, ignoring $x$. Alternately, examining a confusion matrix can explain mistakes or patterns in the classifier.

Regression * Squared Error (L2) * Absolute Error (L1) * Robust loss, log-loss, log-likelihood

Classification * classification error * margin * log-loss, log-likelihood

Frequently use gradient descent.