1 Data Input and Manipulation (recap)

2 Graphics in R

3 Trellis graphics: lattice

4 Other fun graphics

5 Homework!

6 Resources

2 Graphics in R

3 Trellis graphics: lattice

4 Other fun graphics

5 Homework!

6 Resources

_______________________________________________________________________________________________

A brief review of day 1 material.

_______________________________________________________________________________________________

R provides a number of ways to visualize data. Data visualization can be an important part of any analysis. It also provides an opportunity to use our data manipulation skills.

The data is from the Center for Disease Control’s Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010.

- 1.
- Input the data using read.csv, creating a variable brfss to hold it. Use file.choose(), the data ﬁle is 'BRFSS-subset.csv'.
- 2.
- Explore the data using class, dim, head, summary, etc. Use xtabs to summarize the number of males and females in the study, in each of the two years.
- 3.
- An expression like xtabs(Weight ˜ Sex, brfss) calculates the sum of the Weight column, separately for each Sex. Can you divide this sort of cross tabulation by the result of another xtabs function to calculate average weight per sex in each year of the study?
- 4.
- Create Figure 1 using the plot function and the main argument. Note the transformed Y-axis. Experiment with
diﬀerent plotting symbols (try the command example(points).

- 5.
- In the previous plot, color the female and male points diﬀerently. To do this, use the col argument to plot. Provide as a value to that argument a vector of colors, subset by brfss$Sex.
- 6.
- Create a subset of the data containing only observations from 2010.
- 7.
- Create Figure 2 (two panels in a single ﬁgure). Do this by using the par function with the mfcol argument
before calling plot. You’ll need to create two more subsets of data, perhaps when you are providing the data to
the function plot.

- 8.
- Plotting large numbers of points means that they are often over-plotted, potentially obscuring important patterns. Experiment with arguments to plot to address over-plotting, e.g., pch=’.’ or alpha=.4. Try using the smoothScatter function (the data have to be presented as x and y, rather than as a formula). Try adding the hexbin library to your R session (using library) and creating a hexbinplot.

_______________________________________________________________________________________________

R has a number of additional plotting facilities, both as part of the ‘base’ distribution and user-contributed packages. The lattice package adopts a particular philosophy to the presentation of data, and can be a very eﬀective visualization tool.

- 1.
- Use library to load the lattice package.
- 2.
- Create Figure 3 using the xyplot function with a formula and the brfss2010 data. The formula is
sqrt(Weight) ˜ Height | Sex, which can be read as ‘square root of Weight as a function of Height,
conditioned on Sex’.

- 3.
- Add a background grid and a regression line to each panel using the argument type=c(’p’, ’g’, ’r’); change the width (lwd) and color (col.line) of the regression line.
- 4.
- Create Figure 4. Use the densityplot function with the formula ˜ sqrt(Weight). The group=Sex function
argument creates separate lines for each sex. Be sure to use plot.points=FALSE to avoid a ‘rug’ of points at the
base of the ﬁgure. Can you add a key (legend)?

- 5.
- Create the left panel of Figure 5 using the bwplot. The formula requires that Year be coerced to a factor class object, factor(Year).
- 6.
- Create the right panel of Figure 5, a violin plot, using bwplot and the panel argument set to panel.violin.
panel is a function that determines how each panel is drawn.

- 7.
- (Advanced) We can write our own panel argument to lattice functions to inﬂuence how each panel is displayed. Here we add a point at the median age and weight.

_______________________________________________________________________________________________

The grammar of graphics: ggplot2 The ggplot2 user-contributed package produces very pretty looking graphics.

ggplot2 takes a diﬀerent approach to constructing ﬁgures. The idea is that one starts with a plot (created by ggplot, specifying the data source). The plot is an object. One then adds layers (e.g., geom_density(), to create a density plot) and aesthetics (e.g., aes(), specifying which data to plot. A basic command might look like

- 1.
- Create Figure 6. You’ll want to add the alpha argument to geom_density, use the scale_x_continuous and
scale_y_continuous functions, and the theme function with argument legend.position=’top’.

Web-based graphics Explore the googleVis package, e.g., example(gvisMotion); example(gvisGeoMap); these are used to excellent eﬀect in Hans Rosling’s TED presentation on poverty.

_______________________________________________________________________________________________

The datasets you’ll access in this exercise are available in R for just this type of practice.

- Explore the quakes data frame. What is the range of the values in each of the columns?
- Have a look at the iris data. How many measurements are there of petal length for each species? Assume the sepals are rectangles – what is the median sepal area of each species? Its distribution for each species?
- What’s the source of the mtcars data? The weight in nanograms of the Datsun 710? Present a table for every letter in the alphabet, of the average mpg of the car models with that letter as the second letter in their name. Is the mpg of models with ’o’ as the second letter signiﬁcantly diﬀerent from those with ’e’ as the second letter? Interpret the result.

_______________________________________________________________________________________________