The notebook shows an example of basic velocity anlaysis of E12.5 chromaffin data, which is a SMART-seq2 dataset. It shows how to estimate gene-relative velocity (with different pooling options), as well as how to estimate using alternative gene-structure based model.
We assume that the SMART-seq2 data has been demultiplexed and aligned, so that we have one bam file per cell. The recommended processing step is then to use velocyto.py
command line tool to annotate spliced, unspliced and spanning reads in the measured cells:
velocyto run_smartseq2 -d 1 `find ./ data/e12.5.bams/ -name "*.bam" -print` mm10.genes.gtf
If you want to re-run this command, please download bam files (and genes.refFlat) from here, and extract it (tar xvf bams.tar
) in the working directory. note: the file is fairly large - 5.6 GB! Running this command will produce a velocyto directory with two files:
onefilepercell_A1_unique_and_others_J2CH1.loom
- a loom file containing spliced, unspliced and spanned matricesdump/onefilepercell_A1_unique_and_others_J2CH1.hdf5
- an HDF5 file containing detailed molecular mapping information which we will use to estimate parameters for the gene-structure based model. You don’t need this for regular velocity estimation, and in this case -d 1
option may be skipped when running velocyto.pynote: if you don’t want to re-run the command, you can download the velocyto.py output folder here
Load the velocyto package:
library(velocyto.R)
Load the data and genome annotations:
Read in count matrices from the loom file:
ldat <- read.loom.matrices("data/velocyto/onefilepercell_A1_unique_and_others_J2CH1.loom")
(instead, for the purposes of the example we’ll just read in the resulting structure from the rds file)
ldat <- readRDS(url("http://pklab.med.harvard.edu/velocyto/chromaffin/ldat.rds"))
str(ldat)
List of 4
$ spliced :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:2724270] 3 9 10 13 14 17 19 20 28 29 ...
.. ..@ p : int [1:385] 0 7100 14272 21376 28491 36017 43375 50956 57637 64998 ...
.. ..@ Dim : int [1:2] 23420 384
.. ..@ Dimnames:List of 2
.. .. ..$ : chr [1:23420] "Xkr4" "Rp1" "Sox17" "Mrpl15" ...
.. .. ..$ : chr [1:384] "onefilepercell_A1_unique_and_others_J2CH1:A1_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A10_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A11_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A12_unique.bam" ...
.. ..@ x : num [1:2724270] 24 3 22 136 48 21 15 32 109 13 ...
.. ..@ factors : list()
$ unspliced:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:1528882] 0 4 10 13 14 17 19 20 23 29 ...
.. ..@ p : int [1:385] 0 3911 8043 12057 16176 20795 25046 29389 33438 37382 ...
.. ..@ Dim : int [1:2] 23420 384
.. ..@ Dimnames:List of 2
.. .. ..$ : chr [1:23420] "Xkr4" "Rp1" "Sox17" "Mrpl15" ...
.. .. ..$ : chr [1:384] "onefilepercell_A1_unique_and_others_J2CH1:A1_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A10_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A11_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A12_unique.bam" ...
.. ..@ x : num [1:1528882] 5 23 4 6 8 2 70 14 1 15 ...
.. ..@ factors : list()
$ ambiguous:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:198206] 3 19 30 121 144 145 188 295 366 387 ...
.. ..@ p : int [1:385] 0 499 993 1562 2070 2639 3184 3785 4321 4821 ...
.. ..@ Dim : int [1:2] 23420 384
.. ..@ Dimnames:List of 2
.. .. ..$ : chr [1:23420] "Xkr4" "Rp1" "Sox17" "Mrpl15" ...
.. .. ..$ : chr [1:384] "onefilepercell_A1_unique_and_others_J2CH1:A1_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A10_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A11_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A12_unique.bam" ...
.. ..@ x : num [1:198206] 17 1 14 24 9 1 5 23 1 2 ...
.. ..@ factors : list()
$ spanning :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:443099] 3 17 40 80 111 132 146 161 200 216 ...
.. ..@ p : int [1:385] 0 1089 2250 3556 4768 6201 7494 8878 10237 11349 ...
.. ..@ Dim : int [1:2] 23420 384
.. ..@ Dimnames:List of 2
.. .. ..$ : chr [1:23420] "Xkr4" "Rp1" "Sox17" "Mrpl15" ...
.. .. ..$ : chr [1:384] "onefilepercell_A1_unique_and_others_J2CH1:A1_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A10_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A11_unique.bam" "onefilepercell_A1_unique_and_others_J2CH1:A12_unique.bam" ...
.. ..@ x : num [1:443099] 1 2 1 3 2 1 1 1 1 1 ...
.. ..@ factors : list()
Let’s reduce the cell names to the short well labels:
ldat <- lapply(ldat,function(x) {
colnames(x) <- gsub("_unique.bam","",gsub(".*:","",colnames(x)))
x
})
Read in cell cluster assignment and tSNE embedding used in the Furlan et al. (Science’17).
cell.colors <- readRDS(url("http://pklab.med.harvard.edu/velocyto/chromaffin/cell.colors.rds"))
emb <- readRDS(url("http://pklab.med.harvard.edu/velocyto/chromaffin/embedding.rds"))
Spliced expression magnitude distribution across genes:
hist(log10(rowSums(ldat$spliced)+1),col='wheat',xlab='log10[ number of reads + 1]',main='number of reads per gene')
Set up expression matrices, filtering genes to leave those that exceed some pre-defined g to the average expression magnitude
# exonic read (spliced) expression matrix
emat <- ldat$spliced;
# intronic read (unspliced) expression matrix
nmat <- ldat$unspliced
# spanning read (intron+exon) expression matrix
smat <- ldat$spanning;
# filter expression matrices based on some minimum max-cluster averages
emat <- filter.genes.by.cluster.expression(emat,cell.colors,min.max.cluster.average = 5)
nmat <- filter.genes.by.cluster.expression(nmat,cell.colors,min.max.cluster.average = 1)
smat <- filter.genes.by.cluster.expression(smat,cell.colors,min.max.cluster.average = 0.5)
# look at the resulting gene set
length(intersect(rownames(emat),rownames(nmat)))
[1] 8548
# and if we use spanning reads (smat)
length(intersect(intersect(rownames(emat),rownames(nmat)),rownames(smat)))
[1] 1696
We’ll start with what is perhaps the most robust estimate, that combines cell kNN pooling with the gamma fit based on an extreme quantiles:
Using min/max quantile fit, in which case gene-specific offsets do not require spanning read (smat) fit. Here the fit is based on the top/bottom 5% of cells (by spliced expression magnitude).
fit.quantile <- 0.05;
rvel.qf <- gene.relative.velocity.estimates(emat,nmat,deltaT=1,kCells = 5,fit.quantile = fit.quantile)
calculating cell knn ... done
calculating convolved matrices ... done
fitting gamma coefficients ... done. succesfful fit for 8548 genes
filtered out 1306 out of 8548 genes due to low nmat-emat correlation
filtered out 754 out of 7242 genes due to low nmat-emat slope
calculating RNA velocity shift ... done
calculating extrapolated cell state ... done
We visualize the velocities by projecting observed and extrapolated cells onto the first 5 PCs:
pca.velocity.plot(rvel.qf,nPcs=5,plot.cols=2,cell.colors=ac(cell.colors,alpha=0.7),cex=1.2,pcount=0.1,pc.multipliers=c(1,-1,-1,-1,-1))
log ... pca ... pc multipliers ... delta norm ... done
done