Alluvial diagrams

Thu, 27 Mar 2014 00:00:00 +0000

Parallel coordinates plot is one of the tools for visualizing multivariate data. Every observation in a dataset is represented with a polyline that crosses a set of parallel axes corresponding to variables in the dataset. You can create such plots in R using a function parcoord in package MASS. For example, we can create such plot for the built-in dataset mtcars:

library(MASS)
library(colorRamps)
data(mtcars)
k <- blue2red(100)
x <- cut( mtcars$mpg, 100)
op <- par(mar=c(3, rep(.1, 3)))
parcoord(mtcars, col=k[as.numeric(x)])

par(op)

The lines are colored using a blue-to-red color ramp according to the miles-per-gallon mpg (first variable).

What to do if some of the variables are categorical? One approach is to use polylines with different width. Another approach is to add some random noise (jitter) to the values. Titanic data is a crossclassification of Titanic passengers according to class, gender, age, and survival status (survived or not). Consequently, all variables are categorical. Let’s try the jittering approach. After converting the crossclassification (R table) to data frame we “blow it up” by repeating observations according to their frequency in the table.

data(Titanic)
# convert to data frame of numeric variables
titdf <- as.data.frame(lapply(as.data.frame(Titanic), as.numeric))
# repeat obs. according to their frequency
titdf2 <- titdf[ rep(1:nrow(titdf), titdf$Freq) , ]
# new columns with jittered values
titdf2[,6:9] <- lapply(titdf2[,1:4], jitter)
# colors according to survival status, with some transparency
k <- adjustcolor(RColorBrewer::brewer.pal(3, "Set1")[titdf2$Survived], alpha=.2)
op <- par(mar=c(3, 1, 1, 1))
parcoord(titdf2[,6:9], col=k)

Figure 1: Red lines are for passengers who did not survive.

par(op)

It is not so easy to read, is it. Did the majority of 1st class passengers (bottom category on leftmost axis) survived or not? Definitely most of women from that class did, but in aggregate?

At this point it would be nice to, instead of drawing a bunch of lines, to draw segments for different groups of passengers. Later I learned that such plot exists and even has a name: alluvial diagram. They seem to be related to Sankey diagrams blogged about on R-bloggers recently, e.g. here. What is more, I was not alone in thinking how to create such a thing with R, see for example here. Later I found that what I need is a “parallel set” plot, as it was called, and implemented, on CrossValidated here. Thats look terrific to me, nevertheless, I still would prefer to:

The axes to be vertical. If the variables correspond to measurements on different points in time, then we should have nice flows from left to right.
If only the segments could be smooth curves, e.g. splines or Bezier curves…

And so I wrote a prototype function alluvial (tadaaa!), now in a package alluvial on Github. I strongy relied on code by Aaron from his answer on CrossValidated (hat tip).

See the following examples of using alluvial on Titanic data:

First, just using two variables Class and Survival, and with stripes being simple polygons.

# load packages and prepare data
library(alluvial)
tit <- as.data.frame(Titanic)
# only two variables: class and survival status
tit2d <- aggregate( Freq ~ Class + Survived, data=tit, sum)
alluvial( tit2d[,1:2], freq=tit2d$Freq, xw=0.0, alpha=0.8,
gap.width=0.1, col= "steelblue", border="white",
layer = tit2d$Survived != "Yes" )

The function accepts data as (collection of) vectors or data frames. The xw argument specifies the position of the knots of xspline relative to the axes. If positive, the knot is further away from the axis, which will make the stripes go horizontal longer before turning towards the other axis. Argument gap.width specifies distances between categories on the axes.

Another example is showing the whole Titanic data. Red stripes for those who did not survive.

alluvial(tit[,1:4], freq=tit$Freq, border=NA,
hide = tit$Freq < quantile(tit$Freq, .50),
col=ifelse( tit$Survived == "No", "red", "gray"))

Now its possible to see that, e.g.:

A bit more than 50% of 1st class passangers survived
Women who did not survive come almost exclusively from 3rd class, etc.

In this variant the stripes have no borders, color transparency is at 0.5, and for the purpose of the example the plot shows only “thickest” 50% of the stripes (argument hide).

As compared to the parallel set solution mentioned earlier, the main differences are:

Axes are vertical instead of horizontal
I used xspline to draw the “stripes”
with argument hide you can skip plotting of selected groups of cases

If you have suggestions or ideas for extensions/modifications, let me know on Github!

Stay tuned for more examples from panel data.

visualization on Brokering Closure

Alluvial diagrams