_______________________________________________________________________________________________
Today we’ll cover statistical concepts and tests commonly used in cancer research. The dataset we’ll access is a subset of the ALL expression data whose patient information we worked with in the first day’s material. In addition to that information we’ll access 1000 associated expression microarray features that present the highest variance across the patient samples. The data have been saved in a binary format to reduce file sizes.
_______________________________________________________________________________________________
Let’s look a little more closely at patient information in the info file:
∙ Some simple plots of patient ages – note the nested functions!
∙Exercise: Plot the sortedAge with markers at each data point and connect the points with red lines.
∙Exercise: Plot one variable (e.g., age) as a function of another (e.g., sex). Since sex is a factor, R chooses to create a box plot; does this make sense?
∙ Histograms, and their display options:
∙ Cross tables use formulas to describe the relationship between the data they present:
∙Exercise: How many hyperdiploid males are refractory? Hint: review the ’doc’ data frame from last week’s lesson for column descriptions.
_______________________________________________________________________________________________
Use plot to visualize the distribution of female and male ages in the info data set.
It looks like females are on average older than males. Use t.test to find out.
Check out the help page for t.test
What are all those additional arguments to t.test? For example, what is the meaning of the var.equal argument?Why are there 79.88 degrees of freedom?
A t-test can also be viewed as an analysis of variance (ANOVA); analysis of variance is a form of linear model. Use lm to fit a linear model that describes how age changes with sex; the anova function summarizes the linear model in a perhaps more familiar ANOVA table.
What kinds of assumptions are being made in the linear model, e.g., about equality of variances? Try plotting fit; what are the figures trying to tell you?
fit is an example of an R object. Find out it’s class
plot is an example of an R generic; it has different methods implemented for different classes of objects. Use methods to see available methods
Look up the help page for the plot generic, lm method with
Fitted models can be used in other functions, for instance to predict values for new data. Construct a data.frame with a single column sex with values "M" and "F". Consult the help page for the predict.lm method, and calculate the expected value of the fitted model for males and for females.
What do the predicted values correspond to in the t.test? Use coefficients to extract the coefficients of the fitted model.
Interpret the (Intercept) and sexM coefficients in terms of female and male ages.
The article from which the info object is derived states that “Although chromosome translocations and molecular rearrangements are relatively infrequent in T-lineage ALL, these events occur commonly in B-lineage ALL and reflect distinct mechanisms of transformation”. Let’s investigate this statement.
The relevant columns of data are summarized as
Simplify the number of BT levels by creating a map between subtypes and types
The names of the map are the subtypes, the elements are the types. Map the levels of the BT variable from sub-type to type with the following command, and cross-tabulate the data
The data are qualitatively consistent with the statement that molecular rearrangements are more common in B-lineage ALL. Let’s test this with a chi-squared test
Interpret the results. What about additional parameters documented on ?chisq.test?
_______________________________________________________________________________________________
Earlier we read in the ALL1k object. This is a data frame. The rows represent 1000 microarray probe sets. The first two columns contain information about the probe sets. The remaining 128 columns are (normalized) expression values of 128 samples. The 128 samples correspond to the phenotypic data in the info object we’ve been working with. The following lines take the expression values and transform them from a data.frame to a matrix, adjusting the row and column names to reflect the probe and sample ids.
Some of the results below involve plots, and it’s convenient to choose pretty and functional colors. We use the RColorBrewer package; see colorbrewer.com.
‘divergent’ is a vector of colors that go from red (negative) to blue (positive). ‘higlight’ is a vector of length 2, light and dark green.
It is hard to visualize data from 1000 probesets and 128 samples. In addition, the data suffers from the problem that there are many more measurements per sample (1000) than there are samples (128). For these reasons, we might wish to reduce the number of dimensions in which the data is represented. One way of reducing the dimensionality is by calculating principle components. Here we calculate principle components:
The principle components are, roughly, vectors that pass through the data in such a way as to explain as much variability as possible. We can visualize the data in two-dimensional space by plotting the ‘rotation’, coloring points by whether they belong to the B or T ALL lineage (we will discuss other arguments to plot in a subsequent class).
Notice that the B and T lineages separate quite well.
There are a number of variants to principle components, and some tricky statistical issues involved even in the simple example above, so proceed with caution!
Clustering, or ‘unsupervised machine learning’, tries to group samples based on similarity. We start by calculating the correlation between each of our samples.
cm is a 128 x 128 matrix, and it measures the correlation between each pair of samples – samples with similar patterns of gene expression will be correlated.
We can use the strength of correlation as a basis for describing the distance between samples. Two samples will be ‘near’ each other if they have similar patterns of correlation with other samples.
There are several ways of measuring distance; the default is a euclidean measure.
The next step is to group (cluster) samples that are similar to one another into a dendrogram summarizing similarity.
Somewhat more revealing of structure is a heatmap, with rows and columns clustered as we’ve just described, and with a column color bar to indicate whether the sample came from a B or T lineage sample.
The plot shows that, mostly, the B lineage samples cluster together, and the T lineages cluster together. If this were an exploratory analysis, one would look carefully to ensure that the mis-clustered samples were not mis-labeled.
Classification (‘supervised machine learning’) groups samples into a pre-specified number of groups. There are many varieties of classification. k-means classification aims to place k ‘centroids’ in such a way that the sum of squares of each point to the nearest centroid is minimized. Here we perform k-means classification with 2 groups, anticipating that we will recover the B and T ALL lineages; we use set.seed to set the random number seed to a particular (arbitrary) value to ensure reproducibility.
The ‘confusion’ matrix is one measure of how effective the classification is; here we see that all but one of the samples were assigned to the correct group.
There are objective criteria for parameter choice and validation of supervised machine learning algorithms.
_______________________________________________________________________________________________
There are several important lessons from this brief tour of statistical functions in R:
_______________________________________________________________________________________________
In the next class we’ll use data from the Center for Disease Control’s Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010. As preparation for the class:
_______________________________________________________________________________________________