Introduction

Oftentimes, I run into difficulty trying to explain some of the concepts of statistical sampling with audiences that either have very limited or no understanding of statistics. Given that the majority of communication of analysis has to be digested in a 1-2 hour meeting, data visualization typically is the best route for communicating my methods.

One such case is unboxing how sampling works based on your presumed population and using that sampling to build a monte carlo simulation. I’m a big fan of interactive visualizations or animations that break down in detail the underpinnings of an analysis. The basis of this post is to demonstrate using ggplot2 to construct the frames of the animation and ImageMagick to combine the frames into a gif of a simple monte carlo simulation.

Generate the Data

First, I’m going to use base R’s random sampling functions for the Poisson and the Negative Binomial to generate samples given the presumed parameters. The numbers are then added together to show a very basic monte carlo simulation. In addition, a “slice” of the data is taken that I’m going to use later to break down each step of the simulation.

library(ggplot2)
library(trstyles)
library(dplyr)
library(tidyr)
set.seed(82)
n <- 10000
mcHist <- data_frame(Poisson = rpois(n, 3), 
                  NegBinom = rnbinom(n, 5, .5)) %>%
  mutate(Simulation = Poisson + NegBinom) %>%
  gather(Distribution, Value) %>%
  mutate(Distribution = sub("NegBinom", "Negative Binomial", Distribution))

mcSample <- mcHist %>%
  group_by(Distribution) %>%
  slice(1:200) %>%
  gather(Distribution, Value) %>%
  group_by(Distribution) %>%
  mutate(rowNum = row_number(Distribution))

lapply(split(mcHist[['Value']], mcHist[['Distribution']]), summary)
$`Negative Binomial`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   5.017   7.000  25.000 

$Poisson
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.000   3.000   2.991   4.000  12.000 

$Simulation
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   5.000   8.000   8.008  10.000  29.000 

Histogram of Expected Results

The background image that I want to show is what is the theoretical model that we are moving towards. I created a large sample of 10,000 so that random number generators will give a close approximation of the population (the presumed distributions). I chose discrete distributions above so that it will be easy to plot histograms without having to worry about the bin widths. With a little tweaking using facet_grid and geom_text, I can lay out each distribution and annotate the calculation.

g <- ggplot(mcHist) +
  geom_histogram(aes(Value, ..density.., color = Distribution), 
                 binwidth = 1, alpha = .3, fill = "transparent") +
  scale_color_tr(guide = FALSE) +
  facet_grid(~Distribution) +
  theme_tr() + 
  coord_cartesian(ylim = c(0, .23)) +
  labs(x = "", y = "Frequency") +
  theme(panel.grid = element_blank(), 
        strip.text = element_text(size = 16, hjust  = 0.1),
        axis.text.x = element_blank()) +
  geom_text(x = rep(21, 3*n), y = rep(.1, 3*n), 
            label = rep(c("+", "=", ""), each = n), size = 24)
g

Create Frames for Animation

Now that I have my starting frame to show each individual sample, I am going to use geom_dotplot to plot over the histogram. To help track the simulation moving through time, the latest sample will be colored red. The for loop iterates over each row number, filters to the current state of the simulation, and updates what is to be plotted. Then the handy ggsave saves each frame for the animated gif.

for (i in 1:length(unique(mcSample$rowNum))) {
  dataUpdate <- mcSample %>%
    group_by(Distribution) %>%
    filter(rowNum %in% 1:i) %>%
    group_by(Distribution) %>%
    mutate(Last = rowNum == i)
  gUpdate <- g +
  geom_dotplot(data = dataUpdate,
               aes(Value, fill = Last), color = NA, 
               binwidth = 1, method = "histodot",
               dotsize = .6) +
    scale_fill_manual(guide = FALSE, values = c("black", "red")) 
  #gUpdate  
  ggsave(filename = paste0('frames/plot', sprintf("%03d",i) , '.jpg'),
         plot = gUpdate, device = 'jpeg')
}

Convert Using ImageMagick

All the heavy lifting is over now. I just need to use ImageMagick to process the frames and convert to a gif. You can use a call to the command line here, but my personal preference is to just switch to the command line. ImageMagick is a standalone utility software that you will need to install separately from R.

convert -delay 20 -loop 0 *.jpg ../mcsim_animation.gif

I now have an animation that I can use to explain my distribution assumptions and show how each sample updates the model. It’s pretty clear over time that the sampling will eventually converge to the theoretical model.