In order to run R on Orchestra, we will first connect to an interactive queue, using 6 cores
jh361@mezzanine:~/$ bsub -n 6 -Is -q interactive bash
Job <7846600> is submitted to queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on clarinet002-072.orchestra>>
Set up environment variables:
source /groups/pklab/scw2014/setup.sh
Run R
R
We will load the counts for the ES/MEF mouse dataset of Islam et al., which contains 48 embyronic stem (ES) cells, and 44 mouse embryonic fibroblast (MEF) cells. We will read in these counts from the table we generated earlier:
data.dir <- "/groups/pklab/scw2014/ES.MEF"
counts <- read.delim(paste(data.dir,"combined.counts",sep="/"),skip=0,header=T,sep="\t",stringsAsFactors=F,row.names=1)
# clean up column (cell) names
colnames(counts) <- gsub("L139_Sample(\\d+).counts","\\1",colnames(counts))
colnames(counts)[1:10]
## [1] "MEF_54" "ESC_47" "MEF_61" "MEF_67" "ESC_19" "MEF_91" "MEF_62"
## [8] "ESC_20" "ESC_13" "MEF_70"
The rows of the counts
data frame represent genes, and the columns represent cells. However, the genes in our count file are named according to their Ensembl ID. In order to map these IDs to more informative gene names, we can use the R interface to BioMart. We will not run this procedure during this session, so as to avoid overloading the servers, but this is how it might be done:
library(biomaRt)
mart <- useMart(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')
bm.query <- getBM(values=rownames(counts),attributes=c("ensembl_gene_id", "external_gene_name"),filters=c("ensembl_gene_id"),mart=mart)
genes <- list(ids=rownames(counts),names=bm.query[match(rownames(counts), bm.query$ensembl_gene_id),]$external_gene_id)
Due to time constraints we will simply read in the gene names from a table
genes <- read.table(paste(data.dir,"gene_name.map",sep="/"))
The row names of the count table can then be set to these gene names:
rownames(counts) <- make.unique(as.character(genes$names))
The identities of the different cells are known in this case, and we can obtain this information from the first three characters of the column labels:
cell.labels <- substr(colnames(counts),0,3)
The dataset contains some cells that were used as negative controls, so we will remove these before further analysis:
counts <- counts[-which(cell.labels=="EMP")]
cell.labels <- cell.labels[-which(cell.labels=="EMP")]
Reorder the cells so that the ES cells (columns) are all in the first half of the matrix:
counts <- cbind(counts[,cell.labels=="ESC"],counts[,cell.labels=="MEF"])
cell.labels <- substr(colnames(counts),0,3)
We can then examine some features of the dataset:
length(counts[,1]) # number of genes
## [1] 39017
length(cell.labels) # number of cells
## [1] 92
nES <- sum(cell.labels=="ESC") # number of ES cells
nMEF <- sum(cell.labels=="MEF") # number of MEF cells
A factor object can be defined to index the different cell types.
groups <- factor(cell.labels,levels=c("ESC","MEF"))
table(groups)
## groups
## ESC MEF
## 48 44
Before proceeding, we will restrict our attention to those genes that had non-zero expression:
counts <- counts[rowSums(counts)>0,]
nGenes <- length(counts[,1])
Let's also save the full cleaned up count matrix for use in subsequent exersize:
save(counts,file="counts.RData")
Let's examine the effective library size (estimated by the average read count per gene) for the different cells.
coverage <- colSums(counts)/nGenes
ord <- order(groups)
bar.positions <- barplot(coverage[ord],col=groups[ord],xaxt='n',ylab="Counts per gene")
axis(side=1,at=c(bar.positions[nES/2],bar.positions[nES+nMEF/2]),labels=c("ESC","MEF"),tick=FALSE)
We can see that on average the expression is higher for the MEF cells. We'll filter out those cells with very low coverage
counts <- counts[,coverage>1]
cell.labels <- cell.labels[coverage>1]
groups <- factor(cell.labels,levels=c("ESC","MEF"))
coverage <- coverage[coverage>1]
nCells <- length(cell.labels)
We can use a violin plot to visualize the distributions of the normalized counts for the most highly expressed genes.
counts.norm <- t(apply(counts,1,function(x) x/coverage)) # simple normalization method
top.genes <- tail(order(rowSums(counts.norm)),10)
expression <- log2(counts.norm[top.genes,]+0.001) # add a small constant in case there are zeros
library(caroline)
violins(as.data.frame(t(expression)),connect=c(),deciles=FALSE,xlab="",ylab="log2 expression")
## Loading required package: sm
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:sm':
##
## muscle
Some of these genes show a clear bimodal pattern of expression, indicating the presence of two subpopulations of cells.
Perhaps the simplest statistical model for count data is the Poisson, which has only one parameter. Under a Poisson model, the variance of the expression for a particular gene is equal to its mean expression. However, due to a variety of types of noise (both biological and technical), a better fit for read count data is usually obtained by using a negative binomial model, for which the variance can be written as: \[\mbox{variance} = \mbox{mean} + \mbox{overdispersion} x \mbox{mean}^2\] Since the overdispersion is a positive number, the variance under the negative binomial model is always higher than for the Poisson.
On our dataset, the overdispersion is clearly greater than zero for almost all genes, suggesting that the negative binomial will indeed be a better fit:
means <- apply(counts.norm,1,mean)
excess.var <- apply(counts,1,var)-means
excess.var[excess.var < 0] <- NA
overdispersion <- excess.var / means^2
hist(log2(overdispersion),main="Variance of read counts is higher than Poisson")
library(DESeq)
Several of the steps in the following analyses takes around 5 minutes to run on the full 92-cell dataset, using 6 cores on Orchestra. In order to save time for the purposes of this tutorial session, we will work with a subset of the data, sampling 20 cells from each of the groups:
es.cell.subset <- which(groups=="ESC")[(1:20)*2]
mef.cell.subset <- which(groups=="MEF")[(1:20)*2]
cell.subset <- c(es.cell.subset,mef.cell.subset)
counts <- counts[,cell.subset]
cell.labels <- cell.labels[cell.subset]
groups <- factor(cell.labels,levels=c("ESC","MEF"))
nCells <- length(cell.labels)
First we will use our count
and groups
objects to create a DESeq CountDataSet
object:
cds <- newCountDataSet(counts, groups)
DESeq estimates size factors to measure the library size for each cell.
cds <- estimateSizeFactors(cds)
Generally the estimated size factors are linearly related to the average read count per gene (coverage), but are estimated using a more robust method:
plot(sizeFactors(cds),colSums(counts)/nGenes)
In order to compute \(p\)-values for differential expression, DESeq requires an estimate of the overdispersion for each gene. Since we have \(40\) cells in total, we may wish to rely on the empirical estimates of the dispersion:
cds <- estimateDispersions(cds, sharingMode="gene-est-only")
## Warning: in estimateDispersions: sharingMode=='gene-est-only' will cause
## inflated numbers of false positives unless you have many replicates.
In cases where the empirical estimates may not be so reliable (e.g. when there are very few cells), we can pool information from the other cells in each group in order to obtain a more robust estimator for the dispersion
cds.pooled <- estimateDispersions(cds, method="per-condition",fitType="local")
# check the fit
plotDispEsts(cds.pooled,name="ESC")
plotDispEsts(cds.pooled,name="MEF")
We are now ready to carry out the differential expression analysis.
de.test <- nbinomTest(cds, "ESC", "MEF")
We will also repeat the analysis using the pooled estimates for the dispersion parameters, to see what kind of difference this makes to the results:
de.test.pooled <- nbinomTest(cds.pooled, "ESC", "MEF")
We can then extract the top differentially expressed genes, using a false discovery rate of \(0.05\):
find.significant.genes <- function(de.test.result,alpha=0.05) {
# filter out significant genes based on FDR adjusted p-values
filtered <- de.test.result[(de.test.result$padj < alpha) & !is.infinite(de.test.result$log2FoldChange) & !is.nan(de.test.result$log2FoldChange),]
# order by p-value, and print out only the gene name, mean count, and log2 fold change
sorted <- filtered[order(filtered$pval),c(1,2,6)]
}
de.genes <- find.significant.genes(de.test)
de.genes.pooled <- find.significant.genes(de.test.pooled)
This procedure pulls out a lot of genes as differentially expressed:
length(de.genes[,1])
## [1] 2974
length(de.genes.pooled[,1])
## [1] 778
We can examine the top \(15\) of these
head(de.genes.pooled,n=15)
## id baseMean log2FoldChange
## 11228 Dppa5a 630.24 -6.902
## 11100 Rpl4 24974.59 -3.095
## 12725 Brca1 43.81 -11.325
## 13612 Gm2373 43.59 -11.883
## 11928 D930048N14Rik 35.80 -11.466
## 5668 Anxa3 346.09 7.073
## 1985 Prnp 330.81 7.272
## 10457 Tdh 169.23 -8.306
## 7112 Myadm 414.66 6.418
## 7775 Fanci 28.98 -10.730
## 15969 Pou5f1 73.02 -7.279
## 5056 Gm13242 48.39 -7.926
## 10507 RP23-103I12.13 271.73 6.696
## 6967 Ccnd2 677.10 6.170
## 13700 Fst 263.89 5.867
In this case, three genes show an obvious relationship to stem-cell differentiation:
Gene Function
Dppa5a developmental pluripotency associated
Pou5f1 self-renewal of undifferentiated ES cells
Tdh mitochondrial; highly expressed in ES cells
An alternative approach to modeling the error noise in single cells and testing for differential expression is implemented by the scde package.
library(scde)
One of the major sources of technical noise in single-cell RNA seq data are dropout events, whereby genes with a non-zero expression are not detected in some cells due to failure to amplify the RNA.
The package SCDE (Single-Cell Differential Expression) explicitly models this type of event, estimating the probability of a dropout event for each gene, in each cell. The probability of differential expression is then computed after accounting for dropouts.
First we will set the number of cores to the number we selected when opening up the interactive queue:
n.cores <- 6
We then fit the SCDE model to the data. Since we fit models separately for each cell in SCDE, we pass in unnormalized counts.
scde.fitted.model <- scde.error.models(counts=counts,groups=groups,n.cores=n.cores,save.model.plots=F)
The fitting process relies on a subset of robust genes that are detected in multiple cross-cell comparisons. Since the groups
argument is supplied, the error models for the two cell types are fit independently (using two different sets of robust genes). If the groups
argument is omitted, the models will be fit using a common set.
We also need to define a prior distribution for the gene expression magnitude. We will use the default provided by SCDE
scde.prior <- scde.expression.prior(models=scde.fitted.model,counts=counts)
Using the fitted model and prior, we can now compute \(p\)-values for differential expression for each gene
ediff <- scde.expression.difference(scde.fitted.model,counts,scde.prior,groups=groups,n.cores=n.cores)
p.values <- 2*pnorm(abs(ediff$Z),lower.tail=F) # 2-tailed p-value
p.values.adj <- 2*pnorm(abs(ediff$cZ),lower.tail=F) # Adjusted to control for FDR
significant.genes <- which(p.values.adj<0.05)
length(significant.genes)
## [1] 467
The adjusted \(p\)-values are rescaled in order to control for false discovery rate (FDR) rather than the proportion of false positives. This is one way of dealing with the issue of multiple hypothesis testing. As well as correcting for multiple testing, we can also instruct SCDE to correct for any known batch effects. Examples of this procedure can be found in the online tutorial for SCDE.
We can now extract fold differences for the differentially expressed genes, with lower and upper bounds, and FDR-adjusted p-values:
ord <- order(p.values.adj[significant.genes]) # order by p-value
de <- cbind(ediff[significant.genes,1:3],p.values.adj[significant.genes])[ord,]
colnames(de) <- c("Lower bound","log2 fold change","Upper bound","p-value")
Examining the top \(15\) most significant differentially expressed genes, there are now additional genes related to stem cell differentiation
de[1:15,]
## Lower bound log2 fold change Upper bound p-value
## Apoe 5.543 7.580 9.503 3.064e-09
## Hmgb2 5.769 7.730 9.654 3.064e-09
## Tdh 6.411 8.371 10.332 3.064e-09
## Dppa5a 7.240 8.975 10.295 3.064e-09
## Pou5f1 4.864 6.637 8.522 3.064e-09
## 4930509G22Rik 4.902 6.863 9.012 6.521e-09
## Efcc1 5.166 7.353 9.578 9.323e-09
## Gm2373 5.053 7.504 9.540 1.564e-08
## Ifitm1 4.374 6.373 8.447 7.749e-08
## Col12a1 -9.465 -7.316 -5.091 8.620e-08
## Zfp42 4.223 6.109 8.145 1.262e-07
## Ccnd2 -8.824 -6.637 -4.638 1.324e-07
## Col1a1 -8.485 -6.788 -4.827 1.728e-07
## Hmga1 3.809 5.543 7.353 2.241e-07
## Utf1 4.148 6.260 8.258 7.249e-07
Gene Function
Dppa5a developmental pluripotency associated
Pou5f1 self-renewal of undifferentiated ES cells
Tdh mitochondrial; highly expressed in ES cells
Zfp42 used as a marker for undifferentiated pluripotent stem cells
Utf1 undifferentiated ES cell transcription factor
We can also examine how many of the top \(20\) genes found by both methods are the same:
intersect(rownames(de)[1:20],de.genes[1:20,1])
## [1] "Dppa5a"
intersect(rownames(de)[1:20],de.genes.pooled[1:20,1])
## [1] "Tdh" "Dppa5a" "Pou5f1" "Gm2373" "Ccnd2" "Utf1" "Gm13242"
For our example, estimating the dispersion using the pooled method in DESeq yields more genes in common with SCDE, and the four that are annotated all have some connection to stem-cell differentiation.
SCDE also provides facilities for more closely examining the expression differences for individual genes. Examining two of the genes shown above, we can see clear differences in the posterior distributions for expression magnitude.
scde.test.gene.expression.difference("Tdh",models=scde.fitted.model,counts=counts,prior=scde.prior)
## lb mle ub ce Z cZ
## Tdh 6.373 8.334 10.33 6.373 7.161 7.161
scde.test.gene.expression.difference("Pou5f1",models=scde.fitted.model,counts=counts,prior=scde.prior)
## lb mle ub ce Z cZ
## Pou5f1 4.902 6.675 8.56 4.902 7.153 7.153
On these plots, the coloured curves in the background show the distributions inferred for individual cells, and the dark curves denote overall distributions. By combining information from all the cells, the model is able to infer the overall distribution with high confidence.