Michael Data

Visualization

“When you can plot stuff, plot it.” - Padhraic Smyth

Histograms

Histograms can provide a simple way to visualize the shape of univariate data. Histograms can also be helpful for quickly identifying outliers.

You have to be careful that you use a proper number of bins. Too many bins lead to outlier spikes and empty bins. Too few bins do not convey an accurate shape.

Histograms are also less useful when the data is skewed. It is frequently useful to transform the data e.g. by taking the log, and then plot the transformed data.

Histograms can also be misleading for small data sets. Histograms can also be smoothed with various techniques. They also effectively only work on one variable at a time.

Box Plots

Box plots are good for comparing a common variable between multiple groups. The “box” portion represents the middle 50% of data points, with whiskers and outliers shown outside the box.

Correlations

The diagonal of a correlation matrix plot should always show +1.

Overplotting

With large data sets, the plot can be filled too much so that many points are on top of each other and it is difficult to see any structure in the image. Simplifications such as contour plots may help here.

Linked Plotting

For interactive visualization, can allow for interaction between plots. e.g. clicking a data point in one plot highlights that point in the other plots.

Aesthetic Tweaks

Color may easily convey different groups within a plot.

Visualizing Time-Series Data

Seasonality in time-dependent data is very common. It is important to identify appropriate time units with which to make comparisons, as opposed to just e.g. “last month”.