Day 4: Visualization

Jerry Davison and Martin Morgan

____________________________________________________________________________________________

Contents

1 Data Input and Manipulation (recap)
2 Graphics in R
3 Trellis graphics: lattice
4 Other fun graphics
5 Homework!
6 Resources

_______________________________________________________________________________________________

1 Data Input and Manipulation (recap)

A brief review of day 1 material.

_______________________________________________________________________________________________

2 Graphics in R

R provides a number of ways to visualize data. Data visualization can be an important part of any analysis. It also provides an opportunity to use our data manipulation skills.

The data is from the Center for Disease Control’s Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010.

1.
Input the data using read.csv, creating a variable brfss to hold it. Use file.choose(), the data file is 'BRFSS-subset.csv'.
2.
Explore the data using class, dim, head, summary, etc. Use xtabs to summarize the number of males and females in the study, in each of the two years.
3.
An expression like xtabs(Weight ˜ Sex, brfss) calculates the sum of the Weight column, separately for each Sex. Can you divide this sort of cross tabulation by the result of another xtabs function to calculate average weight per sex in each year of the study?
4.
Create Figure 1 using the plot function and the main argument. Note the transformed Y-axis. Experiment with different plotting symbols (try the command example(points).
pict
Figure 1: Square root of Weight vs. Height, all data

5.
In the previous plot, color the female and male points differently. To do this, use the col argument to plot. Provide as a value to that argument a vector of colors, subset by brfss$Sex.
6.
Create a subset of the data containing only observations from 2010.
  > brfss2010 <- brfss[brfss$Year == "2010", ]
7.
Create Figure 2 (two panels in a single figure). Do this by using the par function with the mfcol argument before calling plot. You’ll need to create two more subsets of data, perhaps when you are providing the data to the function plot.
pict
Figure 2: Square root of Weight vs. Height, 2010 Female (left) and Male (right).

8.
Plotting large numbers of points means that they are often over-plotted, potentially obscuring important patterns. Experiment with arguments to plot to address over-plotting, e.g., pch=’.’ or alpha=.4. Try using the smoothScatter function (the data have to be presented as x and y, rather than as a formula). Try adding the hexbin library to your R session (using library) and creating a hexbinplot.

_______________________________________________________________________________________________

3 Trellis graphics: lattice

R has a number of additional plotting facilities, both as part of the ‘base’ distribution and user-contributed packages. The lattice package adopts a particular philosophy to the presentation of data, and can be a very effective visualization tool.

1.
Use library to load the lattice package.
  > library(lattice)
2.
Create Figure 3 using the xyplot function with a formula and the brfss2010 data. The formula is sqrt(Weight) ˜ Height | Sex, which can be read as ‘square root of Weight as a function of Height, conditioned on Sex’.
pict
Figure 3: Square root of Weight vs. Height, 2010 Female (left) and Male (right).

3.
Add a background grid and a regression line to each panel using the argument type=c(’p’, ’g’, ’r’); change the width (lwd) and color (col.line) of the regression line.
4.
Create Figure 4. Use the densityplot function with the formula ˜ sqrt(Weight). The group=Sex function argument creates separate lines for each sex. Be sure to use plot.points=FALSE to avoid a ‘rug’ of points at the base of the figure. Can you add a key (legend)?
pict
Figure 4: Density of square roots of 2010 Weights, by Sex.

5.
Create the left panel of Figure 5 using the bwplot. The formula requires that Year be coerced to a factor class object, factor(Year).
6.
Create the right panel of Figure 5, a violin plot, using bwplot and the panel argument set to panel.violin. panel is a function that determines how each panel is drawn.
pict pict
Figure 5: Box-and-whiskers (left) and violin (right) plots of Weight as a function of Sex.

7.
(Advanced) We can write our own panel argument to lattice functions to influence how each panel is displayed. Here we add a point at the median age and weight.
  > xyplot(sqrt(Weight) ~ Height|Sex, brfss2010,
  +     panel = function(x, y, ...) {
  +         panel.xyplot(x, y, ...)
  +         panel.points(median(x, na.rm=TRUE), median(y, na.rm=TRUE),
  +             cex=2, pch=20, col="red")
  +     },
  +     type=c("p", "g", "r"), lwd=2, col.line="red", xlim=c(120, 210))

_______________________________________________________________________________________________

4 Other fun graphics

The grammar of graphics: ggplot2 The ggplot2 user-contributed package produces very pretty looking graphics.

  > library(ggplot2)

ggplot2 takes a different approach to constructing figures. The idea is that one starts with a plot (created by ggplot, specifying the data source). The plot is an object. One then adds layers (e.g., geom_density(), to create a density plot) and aesthetics (e.g., aes(), specifying which data to plot. A basic command might look like

  > ggplot(brfss2010) + geom_density() + aes(sqrt(Weight), fill=Sex)

1.
Create Figure 6. You’ll want to add the alpha argument to geom_density, use the scale_x_continuous and scale_y_continuous functions, and the theme function with argument legend.position=’top’.
pict
Figure 6: Weight in Female and Male samples from 2010.

Web-based graphics Explore the googleVis package, e.g., example(gvisMotion); example(gvisGeoMap); these are used to excellent effect in Hans Rosling’s TED presentation on poverty.

_______________________________________________________________________________________________

5 Homework!

The datasets you’ll access in this exercise are available in R for just this type of practice.

_______________________________________________________________________________________________

6 Resources