I've been yelled at for my graphs a lot. So after yet another chiding from my adviser, he suggested that I write up a blog post of the bare minimum requirements for competent data visualization.

1. Label Everything

Or well, not everything, but everything not obvious. So label every axis, as clearly as possible and include units when plotting something that has them, like time, temperature, calories.

The x axis doesn't get labeled because the tick marks clearly indicate that these are dates. The figure title alludes to the relationship/pattern the graph is attempting to describe. Never title a graph something like views vs. uniques vs. dates-that's what labels are for. Explain the big picture in the title. And, when labeling all your data, don't let your labels obscure information.


Instead, move the offending label so that the data is clearly visible and label anything that needs it.

2. Work with the Numbers

Be aware of the sort of scale the data lies on. Is it in the millions, hundreds, microns? What scale is the data usually on? Is the scale affecting the interpretation in any way?


In the graph above, the data looks fairly consistent 'cause the scale was set somewhat arbitrarily. On the other hand, in the graph below the data looks highly variable because the scale is set to the data. An accurate scale would be based on the typical number of comments an open thread gets.

3. Choose Colors Carefully

The following graphs all visualize the exact same data, but you'd be hard pressed to know that because the color messes with the interpretation. Stick to a few hard and fast basics and you'll be fine, and if you want to get all fancy, colorbrewer is a fantastic resource for colormaps and colourlovers is great for palettes.

Sequential data use colors in the same family (so pale red to dark red or yellows to oranges to reds or the like) and arrange them from lightest/mildest color to most vivid in the order that matches your data. So large value in your data is important? That's red. Small value important? Than that's red.

Divergent data if your data centers around some neutral central value-like hot and cold temperatures-make that central value a neutral color (like white) and use opposing colors on each side of the neutral (like shades of red for hot and shades of blue for cold).

Qualitive data such as whether a town is rural, urban, or suburban, use a map where the shades are different enough from each other that it is really clear that the data points are somewhat independent.

4. Know the Audience

As is evident from this post, I'm a huge fan of xkcd style plots. I think they're kind of adorable for blog posts, but that's about the only place they're useful. Sometimes they can be worked into a presentation and sometimes they just won't go over well, and creating for the audience is key. Pay attention to fontstyles, linestyles, point sizes, line weights, font sizes, anything else that's a part of how the graph looks rather than the information in it, because those read very differently in a paper, a presentation, and on the web.

While I may feel the following graph is better suited for the web because of its casualness and very heavy line weights, the one underneath would be far more appropriate in an academic environment.

5. Use the Correct Graph

I know that this should seem trivial, but all too often people either get enraptured with a visualization technique (chord graphs) or are really comfortable with a handful of techniques and try to use them where they just don't belong.

This should not be a line graph because each data point is independent of the other data point, so it is hard to see what is going on. A bar chart does a much better job of showing how much the different hackerspace authors are dominating the front page at the moment.

And yes, everyone hates on pie charts, but they're reasonably good at showing how parts relate to a whole. There's a great argument for pie charts only having two slices (how one part relates to a whole), but the key is that the proportionality of the slice must be clearly identifiable.

There's no hard and fast rule for choosing the correct type. My adviser's suggestion is to think about the questions the visualization is supposed to answer and then work backward from there. There's a similar concept among data visualization people of starting with the story the data is supposed to tell.

For more resources, check out my IPython Notebook of all the graphs in this post, fosslien's Stop Visually Assulting Me, Andrew Abela's Chart Chooser, Krygier's Geo 353, and pretty much anything Tufte.

ETA: Edited some of the graphs per Kuiper Belle's suggestions. The notebook has been updated to show the new graphs and still has the old graphs in it.

ETA2: Alejandro and lostwallet brought up a great addendum to rule #2. When plotting two things of the same type (like pageviews) and in a similar range, keep the scales consistent within the graph and across graphs.