Day 2: Basic statistics

_______________________________________________________________________________________________

1 Data

Today we’ll cover statistical concepts and tests commonly used in cancer research. The dataset we’ll access is a subset of the ALL expression data whose patient information we worked with in the ﬁrst day’s material. In addition to that information we’ll access 1000 associated expression microarray features that present the highest variance across the patient samples. The data have been saved in a binary format to reduce ﬁle sizes.

  > ### Go to your data directory to get these files:
  > dataFile = file.choose() # ALL1k.rda
  > dataName = load(file=dataFile) # "ALL1k" is the data frame's name
  > infoFile = file.choose() # ALL1k_pData.rda
  > infoName = load(file=infoFile) # "info" is the data frame's name

_______________________________________________________________________________________________

2 Descriptive statistics

> median(info$age, na.rm=TRUE)

> range(info$age, na.rm=TRUE)

> quantile(info$age, na.rm=TRUE)

0% 25% 50% 75% 100%
5.0 19.0 29.0 45.5 58.0

  > plot(info$age)
  > plot(sort(info$age))
  > sortedAge = sort(info$age)
  > ?plot

∙Exercise: Plot the sortedAge with markers at each data point and connect the points with red lines.

∙Exercise: Plot one variable (e.g., age) as a function of another (e.g., sex). Since sex is a factor, R chooses to create a box plot; does this make sense?

> plot(age ~ sex, info)

  > hist(info$age)
  > ?hist
  > hist(info$age, br=25)

∙ Cross tables use formulas to describe the relationship between the data they present:

> xtabs(~sex, data=info, exclude=NA)

  sex
   F  M
  42 83

> xtabs(~sex + remission, data=info, exclude=NA)

     remission
  sex CR REF
    F 27   7
    M 71   8

∙Exercise: How many hyperdiploid males are refractory? Hint: review the ’doc’ data frame from last week’s lesson for column descriptions.

_______________________________________________________________________________________________

3 t-tests, linear models, and chi-square tests

3.1 t-tests

Use plot to visualize the distribution of female and male ages in the info data set.

> plot(age ~ sex, info)

> t.test(age ~ sex, info)

   Welch Two Sample t-test

  data:  age by sex
  t = 1.6034, df = 79.88, p-value = 0.1128
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -1.022660  9.504142
  sample estimates:
  mean in group F mean in group M
         35.16667        30.92593

What are all those additional arguments to t.test? For example, what is the meaning of the var.equal argument?Why are there 79.88 degrees of freedom?

> t.test(age ~ sex, info, var.equal=TRUE)

   Two Sample t-test

  data:  age by sex
  t = 1.6266, df = 121, p-value = 0.1064
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -0.9207109  9.4021924
  sample estimates:
  mean in group F mean in group M
         35.16667        30.92593

3.2 Linear models

A t-test can also be viewed as an analysis of variance (ANOVA); analysis of variance is a form of linear model. Use lm to ﬁt a linear model that describes how age changes with sex; the anova function summarizes the linear model in a perhaps more familiar ANOVA table.

> (fit <- lm(age ~ sex, info))

  Call:
  lm(formula = age ~ sex, data = info)

  Coefficients:
  (Intercept)         sexM
       35.167       -4.241

  Analysis of Variance Table

  Response: age
             Df  Sum Sq Mean Sq F value Pr(>F)
  sex         1   497.4  497.41  2.6459 0.1064
  Residuals 121 22747.4  187.99

What kinds of assumptions are being made in the linear model, e.g., about equality of variances? Try plotting fit; what are the ﬁgures trying to tell you?

plot is an example of an R generic; it has diﬀerent methods implemented for diﬀerent classes of objects. Use methods to see available methods

   [1] plot.acf*           plot.data.frame*    plot.decomposed.ts* plot.default
   [5] plot.dendrogram*    plot.density*       plot.ecdf           plot.factor*
   [9] plot.formula*       plot.function       plot.hclust*        plot.histogram*
  [13] plot.HoltWinters*   plot.isoreg*        plot.lm*            plot.medpolish*
  [17] plot.mlm*           plot.ppr*           plot.prcomp*        plot.princomp*
  [21] plot.profile.nls*   plot.spec*          plot.stepfun        plot.stl*
  [25] plot.table*         plot.ts             plot.tskernel*      plot.TukeyHSD*

     Non-visible functions are asterisked

Fitted models can be used in other functions, for instance to predict values for new data. Construct a data.frame with a single column sex with values "M" and "F". Consult the help page for the predict.lm method, and calculate the expected value of the ﬁtted model for males and for females.

> df = data.frame(sex=c("F", "M"))
> predict(fit, df)

1 2
35.16667 30.92593

What do the predicted values correspond to in the t.test? Use coefficients to extract the coeﬃcients of the ﬁtted model.

(Intercept) sexM
35.166667 -4.240741

Interpret the (Intercept) and sexM coeﬃcients in terms of female and male ages.

3.3 Chi-squared tests

The article from which the info object is derived states that “Although chromosome translocations and molecular rearrangements are relatively infrequent in T-lineage ALL, these events occur commonly in B-lineage ALL and reﬂect distinct mechanisms of transformation”. Let’s investigate this statement.

> summary(info[,c("BT", "cyto.normal")])

         BT     cyto.normal
   B2     :36   Mode :logical
   B3     :23   FALSE:69
   B1     :19   TRUE :24
   T2     :15   NA's :35
   B4     :12
   T3     :10
   (Other):13

  > old <- levels(info$BT)
  > new <- substr(old, 1, 1)
  > (map <- setNames(new, old))

B B1 B2 B3 B4 T T1 T2 T3 T4
"B" "B" "B" "B" "B" "T" "T" "T" "T" "T"

The names of the map are the subtypes, the elements are the types. Map the levels of the BT variable from sub-type to type with the following command, and cross-tabulate the data

> levels(info$BT) <- map
> xtabs(~ BT + cyto.normal, info)

     cyto.normal
  BT  FALSE TRUE
    B    56   17
    T    13    7

The data are qualitatively consistent with the statement that molecular rearrangements are more common in B-lineage ALL. Let’s test this with a chi-squared test

> chisq.test(info$BT, info$cyto.normal)

   Pearson's Chi-squared test with Yates' continuity correction

  data:  info$BT and info$cyto.normal
  X-squared = 0.5962, df = 1, p-value = 0.44

Interpret the results. What about additional parameters documented on ?chisq.test?

_______________________________________________________________________________________________

4 Higher dimensional data

Earlier we read in the ALL1k object. This is a data frame. The rows represent 1000 microarray probe sets. The ﬁrst two columns contain information about the probe sets. The remaining 128 columns are (normalized) expression values of 128 samples. The 128 samples correspond to the phenotypic data in the info object we’ve been working with. The following lines take the expression values and transform them from a data.frame to a matrix, adjusting the row and column names to reﬂect the probe and sample ids.

> m <- as.matrix(ALL1k[, -c(1, 2)])

Some of the results below involve plots, and it’s convenient to choose pretty and functional colors. We use the RColorBrewer package; see colorbrewer.com.

  > library(RColorBrewer)
  > divergent <- brewer.pal(11, "RdBu")
  > highlight <- brewer.pal(3, "Set2")[1:2]

‘divergent’ is a vector of colors that go from red (negative) to blue (positive). ‘higlight’ is a vector of length 2, light and dark green.

4.1 Principle components

It is hard to visualize data from 1000 probesets and 128 samples. In addition, the data suﬀers from the problem that there are many more measurements per sample (1000) than there are samples (128). For these reasons, we might wish to reduce the number of dimensions in which the data is represented. One way of reducing the dimensionality is by calculating principle components. Here we calculate principle components:

> tm = t(m)
> pc <- prcomp(tm)

The principle components are, roughly, vectors that pass through the data in such a way as to explain as much variability as possible. We can visualize the data in two-dimensional space by plotting the ‘rotation’, coloring points by whether they belong to the B or T ALL lineage (we will discuss other arguments to plot in a subsequent class).

> color <- highlight[info$BT]
> plot(pc$x, col=color, pch=20, cex=2)

There are a number of variants to principle components, and some tricky statistical issues involved even in the simple example above, so proceed with caution!

4.2 Clustering: hclust

Clustering, or ‘unsupervised machine learning’, tries to group samples based on similarity. We start by calculating the correlation between each of our samples.

cm is a 128 x 128 matrix, and it measures the correlation between each pair of samples – samples with similar patterns of gene expression will be correlated.

We can use the strength of correlation as a basis for describing the distance between samples. Two samples will be ‘near’ each other if they have similar patterns of correlation with other samples.

There are several ways of measuring distance; the default is a euclidean measure.

The next step is to group (cluster) samples that are similar to one another into a dendrogram summarizing similarity.

> cl <- hclust(d)
> plot(cl)

Somewhat more revealing of structure is a heatmap, with rows and columns clustered as we’ve just described, and with a column color bar to indicate whether the sample came from a B or T lineage sample.

> color <- highlight[info$BT]
> heatmap(cm, ColSideColor=color, col=divergent)

The plot shows that, mostly, the B lineage samples cluster together, and the T lineages cluster together. If this were an exploratory analysis, one would look carefully to ensure that the mis-clustered samples were not mis-labeled.

4.3 Classiﬁcation: kmeans

Classiﬁcation (‘supervised machine learning’) groups samples into a pre-speciﬁed number of groups. There are many varieties of classiﬁcation. k-means classiﬁcation aims to place k ‘centroids’ in such a way that the sum of squares of each point to the nearest centroid is minimized. Here we perform k-means classiﬁcation with 2 groups, anticipating that we will recover the B and T ALL lineages; we use set.seed to set the random number seed to a particular (arbitrary) value to ensure reproducibility.

> set.seed(123)
> cl <- kmeans(t(m), 2)

The ‘confusion’ matrix is one measure of how eﬀective the classiﬁcation is; here we see that all but one of the samples were assigned to the correct group.

> xtabs(~ cl$cluster + info$BT)

            info$BT
  cl$cluster  B  T
           1 94  0
           2  1 33

There are objective criteria for parameter choice and validation of supervised machine learning algorithms.

_______________________________________________________________________________________________

5 Lessons learned

There are several important lessons from this brief tour of statistical functions in R:

_______________________________________________________________________________________________

6 Homework

In the next class we’ll use data from the Center for Disease Control’s Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010. As preparation for the class:

_______________________________________________________________________________________________