Michael's Wiki

Dimensionality reduction is an unsupervised learning technique commonly used to make learning from data more reliable, remove extraneous or noisy variables, or computation more tractable. Map from original data space in $X^i$ in $i$ dimensions to a new data space. Related to the task of compression: How much of the underlying structure or information of the original representation is retained?

a.k.a. Variable selection. Given d variables, want to select k.

Stepwise Approaches

a.k.a. greedy approaches. The “best” one to use probably depends on how many irrelevant variables are expected.

As greedy approaches, these are non-optimal. They also may require training th emodel $O(d^2)$ times, which could become expensive for large d.

Forward selection

Train a model for each variable and compute error each time. Select the variable that gives the lowest error E (out of the d candidates). Now evaluate the error given by this variable with one additional variable (out of the [d-1] remaining candidates). Continue adding variables until k have been selected.

Backward selection

Starting with a model that uses all variables, iteratively remove the variable which reduces the most error.

Embedding methods

These tend to do well when the data looks Gaussian, but easily throw away important data when there is something multi-modal happening.

Linear Projections

$x$ is a d-dimensional vector of data measurements, and $a$ is a d-dimensional column weight vector. Assume $a^T A = 1$. Can

$a^T x$ is a projection of $x$ onto one dimension. Can generalize to use a matrix of multiple $a$ vectors to produce more than one dimension. Let $A$ be this matrix.

Principal component analysis is used to do this repeatedly and well. From linear algebra, the eigendecomposition produces eigenvectors and eigenvalues, which when ordered by size produce principal components.

Overall complexity is $O(Nd^2 + d^3)$

• * If N is much larger than d, the first term will dominate.
• * For sparse data, can be optimized to run much faster.
• * Can also be optimized when only interested in the first $k$ eigenvectors/values

Principal components are frequently nice and interpretable.

• * e.g. “eigenimages”

e.g. Lasso

Multidimensional Scaling

Want to find k-dimensional coordinates for each of the N objects such that the Euclidean distances in embedded space matches the known dissimilarity matrix as closely as possible. Data is a distance matrix or dissimilarity matrix.

• * N objects
• * 0s on diagonal
• * symmetric

Typically use a “stress” objective function. $$S = \sum_{i,j} (d(,j) - \delta(i,j))^2$$ Where $d(i,j)$ is the Euclidean distance in the new embedded k-dimensional space, and $\delta(i,j)$ is the Euclidean distance in the original space.

PCA Approach

If the distance matrix represents actual Euclidean distance, then the result tends to be similar to PCA.

Local iterative hill-descending. Non-trivial and can be caught in local minima. Initilization can be random or use a heuristic. $O(N^2k)$ work per iteration.