Over the past few months, I've begun to notice a variety of parallel coordinate-like plots appearing on various online news outlets. As a visualisation researcher, this is obviously really exciting because parallel coordinates plots (often abbreviated to the unfortunate initialism PCP) are usually considered to be more suitable for analysis tasks rather than for disseminative tasks.
A tweet by the brilliant Andy Kirk (@visualisingdata) about parallel coordinates being used to show statistics on Olympic athletes made me realise that it's not just me noticing it. Andy in fact creates the rather nice notion that parallel coordinates plots have been suddenly released out of their academic cage and into the wild, ready for the general public. I quite liked that, and have
shamelessly stolen adapted the idea for the title of this post.
Who usually uses Parallel Coordinates Plots?
A key step in the design of a visualisation is figuring out the target audience, as this will dictate which representation and design choices one should be making. For example, the tasks of data analysis and data dissemination are very different. The former is more exploratory and hypothesis-driven, and the latter more driven by the need to 'get a message across' with as little requirement for interaction as is possible.
The usual target audience for a parallel coordinates plot (PCP) is a data analyst, who is tasked with discovering correlations and patterns within a dataset that spans many dimensions. PCPs are a great 'default' choice for these kinds of tasks, because they tend to quickly reveal the main patterns and trends within the data.
A useful dataset for explaining how PCPs work is the 'cars' dataset, as exemplified on Mike Bostock's example. Imagine that you want to buy a car, and you're trying to come up with a good compromise between features such as power, efficiency, and price. With a little experimentation, you can immediately see on the cars PCP that there is a correlation between the number of cylinders and horsepower, because low values in one tend to produce low values in another, and so on. This agrees with our intuition about car engines. You'll also notice a negative correlation between economy and cylinders, because high values in one tend to produce lower values in the other, and vice versa. Again, this agrees with our intuition - you generally don't get 70MPG from a V8.
We cannot reasonably expect visitors to a graphic on a news website to identify positive and negative correlations in a parallel coordinates plot without some training. And unless they are very interested in your graphic, willing to learn, patient, and can spare the time, they may end up just moving on.
That's why I was surprised and impressed when I saw The Guardian's Olympics parallel coordinates-like plot. They identified what was important to show based on these constraints, and produced something that conveys the message well. So let's take a look at it!
The Olympic Plot
You can see the plot in question for yourself at the Guardian's website.
Personally, I think the team who worked on this plot did an excellent job. It's very pleasing to look at due to the choice of colours, the circles at data/axis intersections, and the use of an orthogonal 'Manhattan'-like polyline connecting the dimensions rather than a straight one.
It's also easy to use, because a data record (a particular athlete) is always selected as the focus. A Voronoi partitioning of the intersection points (shown as coloured circles) is made, and the cell containing the cursor is the selected record. This pre-partitioning relieves the visualisation of the task of continually finding the nearest point to the cursor, offloading the task to the browser's polygon hit testing.
Hiding the Data Context
The most immediate difference between The Guardian's plot with a more conventional PCP is that the data context lines (that is, the polylines for all records aside from the one being highlighted) have been hidden. We do however have the little circles representing their intersections, which leaves us with:
- The highlighted athlete in full (the focus record) with polylines
- The distribution of values along each discipline for all athletes (the context records) as little circles
This is in contrast to the more traditional usage of a PCP whereby the all records (the context) are shown in full with polylines connecting the dimensions together, and a selection of 'records of interest' are displayed in a highlighted colour. These records are usually selected using a technique called brushing to define an 'in-selection' range along an axis. You can see brushing in action for yourself on the cars plot example. Highlighted records are those then passing through all such brushes.
This design choice of hiding the context goes hand-in-hand with the decision to connect data records using the orthogonal lines, as we can observe below where I have modified the chart to display all context records in grey and the focus record in orange.
Due to the orthogonal polylines, the characteristic patterns of a parallel coordinates plot that indicate positive and negative correlations are quite missing, and all that we are sure of is the distribution of the data records for each dimension. It is, for example, almost impossible to tell what correlations exist (if any) between an athlete's 100m hurdles score verses their high jump score.
This is quite by design, as the primary purposes of this graphic are to convey individual athlete performance across multiple disciplines, and allow the user to inspect the general differences between athletes of varying success within the 2016 games.
The difficulty of attempting to spot correlations with these orthogonal polylines can be easily understood by trying to follow one of the grey context lines from one dimension to another. You can be sure of in which direction the next dimensional value will be due to the small curve in that direction. However, one loses the ability to complete the full journey because all such lines inevitably meet horizontally in the middle as one single line.
You can see for yourself in the below interactive example. On the left is a scatterplot, and on the right is the PCP of the same data set. There's a fairly good positive correlation between both dimensions here, except for one outlier which shows nicely in the PCP due to the relatively large difference in slope.
Try changing the line type to orthogonal using the radio buttons. Can you still identify this outlier on the PCP? You can hover over the outlier in the scatter plot to reveal it in the PCP.
Spend some time adding your own data points by clicking directly on the scatter plot. Existing points are removed by clicking them. By playing around, you can quickly get a feel for what kinds of patterns are produced in a PCP for particular distributions of data.
A Drive for Greater Visual Literacy
Any visualisation designer must make sensible compromises for data representation when considering an audience that firstly has no assumed familiarity with such a representation, and secondly does not have the time to learn such a representation.
The design choices made for the Olympics plot are very sensible. The straight lines used for correlation hunting in conventional plots are not required, and are replaced with the simpler orthogonal representations. There is a definite aesthetic benefit in doing so, making the plot appear somewhat calmer and less 'sciencey'. Yes, that's the technical term.
It's really awesome that we are seeing these simplified versions of analysis-driven plot representations appearing in our media. The more readily they can be understood, the closer we get to achieving a higher degree of visual literacy amongst people outside of the academic field of visualisation.
It would certainly be interesting to see whether, in time, such talented designers can steadily increase the complexity of plots until brushing and correlation hunting are as familiar as comparing pie slices. And hopefully a heck of a lot easier.