<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>visualization on Brokering Closure</title><link>https://blog.michalbojanowski.com/categories/visualization/</link><description>Recent content in visualization on Brokering Closure</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><copyright>Michał Bojanowski</copyright><lastBuildDate>Thu, 27 Mar 2014 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.michalbojanowski.com/categories/visualization/index.xml" rel="self" type="application/rss+xml"/><item><title>Alluvial diagrams</title><link>https://blog.michalbojanowski.com/2014/03/27/what-is-alluvial/</link><pubDate>Thu, 27 Mar 2014 00:00:00 +0000</pubDate><guid>https://blog.michalbojanowski.com/2014/03/27/what-is-alluvial/</guid><description>
&lt;p>&lt;a href="http://en.wikipedia.org/wiki/Parallel_coordinates">Parallel coordinates plot&lt;/a> is one of the tools for visualizing multivariate data. Every observation in a dataset is represented with a polyline that crosses a set of parallel axes corresponding to variables in the dataset. You can create such plots in R using a function &lt;code>parcoord&lt;/code> in package MASS. For example, we can create such plot for the built-in dataset mtcars:&lt;/p>
&lt;pre class="r">&lt;code>library(MASS)
library(colorRamps)
data(mtcars)
k &amp;lt;- blue2red(100)
x &amp;lt;- cut( mtcars$mpg, 100)
op &amp;lt;- par(mar=c(3, rep(.1, 3)))
parcoord(mtcars, col=k[as.numeric(x)])&lt;/code>&lt;/pre>
&lt;p>&lt;img src="https://blog.michalbojanowski.com/post/what-is-alluvial/what-is-alluvial_files/figure-html/unnamed-chunk-1-1.png" alt="" width="672" />&lt;/p>
&lt;pre class="r">&lt;code>par(op)&lt;/code>&lt;/pre>
&lt;p>The lines are colored using a blue-to-red color ramp according to the miles-per-gallon &lt;code>mpg&lt;/code> (first variable).&lt;/p>
&lt;p>What to do if some of the variables are categorical? One approach is to use polylines with different width. Another approach is to add some random noise (jitter) to the values. Titanic data is a crossclassification of Titanic passengers according to class, gender, age, and survival status (survived or not). Consequently, all variables are categorical. Let’s try the jittering approach. After converting the crossclassification (R &lt;code>table&lt;/code>) to data frame we “blow it up” by repeating observations according to their frequency in the table.&lt;/p>
&lt;pre class="r">&lt;code>data(Titanic)
# convert to data frame of numeric variables
titdf &amp;lt;- as.data.frame(lapply(as.data.frame(Titanic), as.numeric))
# repeat obs. according to their frequency
titdf2 &amp;lt;- titdf[ rep(1:nrow(titdf), titdf$Freq) , ]
# new columns with jittered values
titdf2[,6:9] &amp;lt;- lapply(titdf2[,1:4], jitter)
# colors according to survival status, with some transparency
k &amp;lt;- adjustcolor(RColorBrewer::brewer.pal(3, &amp;quot;Set1&amp;quot;)[titdf2$Survived], alpha=.2)
op &amp;lt;- par(mar=c(3, 1, 1, 1))
parcoord(titdf2[,6:9], col=k)&lt;/code>&lt;/pre>
&lt;div class="figure">&lt;span style="display:block;" id="fig:unnamed-chunk-2">&lt;/span>
&lt;img src="https://blog.michalbojanowski.com/post/what-is-alluvial/what-is-alluvial_files/figure-html/unnamed-chunk-2-1.png" alt="Red lines are for passengers who did not survive." width="672" />
&lt;p class="caption">
Figure 1: Red lines are for passengers who did not survive.
&lt;/p>
&lt;/div>
&lt;pre class="r">&lt;code>par(op)&lt;/code>&lt;/pre>
&lt;p>It is not so easy to read, is it. Did the majority of 1st class passengers (bottom category on leftmost axis) survived or not? Definitely most of women from that class did, but in aggregate?&lt;/p>
&lt;p>At this point it would be nice to, instead of drawing a bunch of lines, to draw segments for different groups of passengers. Later I learned that such plot exists and even has a name: &lt;a href="http://en.wikipedia.org/wiki/Alluvial_diagram">alluvial diagram&lt;/a>. They seem to be related to Sankey diagrams blogged about on R-bloggers recently, e.g. &lt;a href="http://www.r-bloggers.com/sankey-diagrams-with-googlevis/">here&lt;/a>. What is more, I was not alone in thinking how to create such a thing with R, see for example &lt;a href="http://stackoverflow.com/questions/8222356/how-to-generate-a-graph-diagram-like-google-analyticss-visitor-flow">here&lt;/a>. Later I found that what I need is a “parallel set” plot, as it was called, and implemented, on CrossValidated &lt;a href="http://stats.stackexchange.com/questions/12029/is-it-possible-to-create-parallel-sets-plot-using-r">here&lt;/a>. Thats look terrific to me, nevertheless, I still would prefer to:&lt;/p>
&lt;ul>
&lt;li>The axes to be vertical. If the variables correspond to measurements on different points in time, then we should have nice flows from left to right.&lt;/li>
&lt;li>If only the segments could be smooth curves, e.g. splines or Bezier curves…&lt;/li>
&lt;/ul>
&lt;p>And so I wrote a prototype function &lt;code>alluvial&lt;/code> (tadaaa!), now in a package &lt;a href="https://github.com/mbojan/alluvial">alluvial on Github&lt;/a>. I strongy relied on &lt;a href="http://stats.stackexchange.com/a/12036/31609">code by Aaron from his answer on CrossValidated&lt;/a> (hat tip).&lt;/p>
&lt;p>See the following examples of using &lt;code>alluvial&lt;/code> on Titanic data:&lt;/p>
&lt;p>First, just using two variables Class and Survival, and with stripes being simple polygons.&lt;/p>
&lt;pre class="r">&lt;code># load packages and prepare data
library(alluvial)
tit &amp;lt;- as.data.frame(Titanic)
# only two variables: class and survival status
tit2d &amp;lt;- aggregate( Freq ~ Class + Survived, data=tit, sum)
alluvial( tit2d[,1:2], freq=tit2d$Freq, xw=0.0, alpha=0.8,
gap.width=0.1, col= &amp;quot;steelblue&amp;quot;, border=&amp;quot;white&amp;quot;,
layer = tit2d$Survived != &amp;quot;Yes&amp;quot; )&lt;/code>&lt;/pre>
&lt;p>&lt;img src="https://blog.michalbojanowski.com/post/what-is-alluvial/what-is-alluvial_files/figure-html/unnamed-chunk-3-1.png" alt="" width="672" />&lt;/p>
&lt;p>The function accepts data as (collection of) vectors or data frames. The &lt;code>xw&lt;/code> argument specifies the position of the knots of xspline relative to the axes. If positive, the knot is further away from the axis, which will make the stripes go horizontal longer before turning towards the other axis. Argument &lt;code>gap.width&lt;/code> specifies distances between categories on the axes.&lt;/p>
&lt;p>Another example is showing the whole Titanic data. Red stripes for those who did not survive.&lt;/p>
&lt;pre class="r">&lt;code>alluvial(tit[,1:4], freq=tit$Freq, border=NA,
hide = tit$Freq &amp;lt; quantile(tit$Freq, .50),
col=ifelse( tit$Survived == &amp;quot;No&amp;quot;, &amp;quot;red&amp;quot;, &amp;quot;gray&amp;quot;))&lt;/code>&lt;/pre>
&lt;p>&lt;img src="https://blog.michalbojanowski.com/post/what-is-alluvial/what-is-alluvial_files/figure-html/unnamed-chunk-4-1.png" alt="" width="672" />&lt;/p>
&lt;p>Now its possible to see that, e.g.:&lt;/p>
&lt;ul>
&lt;li>A bit more than 50% of 1st class passangers survived&lt;/li>
&lt;li>Women who did not survive come almost exclusively from 3rd class, etc.&lt;/li>
&lt;/ul>
&lt;p>In this variant the stripes have no borders, color transparency is at 0.5, and for the purpose of the example the plot shows only “thickest” 50% of the stripes (argument &lt;code>hide&lt;/code>).&lt;/p>
&lt;p>As compared to the parallel set solution mentioned earlier, the main differences are:&lt;/p>
&lt;ul>
&lt;li>Axes are vertical instead of horizontal&lt;/li>
&lt;li>I used &lt;code>xspline&lt;/code> to draw the “stripes”&lt;/li>
&lt;li>with argument &lt;code>hide&lt;/code> you can skip plotting of selected groups of cases&lt;/li>
&lt;/ul>
&lt;p>If you have suggestions or ideas for extensions/modifications, let me know on &lt;a href="https://github.com/mbojan/alluvial">Github&lt;/a>!&lt;/p>
&lt;p>Stay tuned for more examples from panel data.&lt;/p></description></item></channel></rss>