A **neural network** is a network of interacting **processing units**. Traditional neural networks tend to have a small number of hidden layers, guided by the Universal Approximation Theorem, and perform feature learning. Modern neural networks combine stochastic techniques and biology-inspired techniques with many hidden layers and are referred to as deep learning.

Can be viewed as extensions of logistic regression.

hidden units

dimensions on the input data

d-dimensional weight vectors

One H-dimensional output vector of weights

A simple example uses a single hidden layer. In general, and are non-linear functions of x. You can add more hidden units and you can add more hidden layers.

A trivial neural network with no hidden units results in a logistic regression function .

Where is an output of the hidden unit and the are output weights to be learned.

The complication of neural networks is that they are not guaranteed convex optimization functions. Feedforward networks are also easier to overfit by adding additional hidden units. The advantage over logistic regression is that neural networks are inherently more expressive.

The processing unit inside the network applies an **activation function** to inputs.

**Rectified Linear Units** (ReLUs), via Hahnloser et al 2000.

**Exponential Linear Units** (ELUs), via Clevert et al 2015.

Let's say we have -dimensional data.

For a single layer of hidden units, need to learn:

- biases for each visible unit
- biases for each hidden unit
- weights for each edge between them
- probabilities , a probability that each hidden unit is activated
- probabilities , a probability that each visible unit is activated

Restricted Boltzmann Machines (RBMs) form a bipartite graph with hidden and visible units. Nodes in the same layer are *restricted* from having dependencies or connections to each other. Many neural networks are analogous to stacks of RBMs ^{1)}.

Comparable to information compression techniques, autoencoders restrict the number of hidden nodes to fewer than the number of input values. This forces the hidden layer to adopt some useful aspect of the input signal.