You can download the data used in this tutorial with the links listed below. It is a single-cell RNA-seq dataset of peripheral blood mononuclear cells from 45 healthy donors from a study by van der Wijst et al. (2018).

Downloads:

  1. scITD requires gene x cell barcode UMI matrix. This object should be of class ‘dgCMatrix’ or ‘matrix’: http://pklab.med.harvard.edu/jonathan/data/pbmc_counts_v2.rds

  2. scITD also requires metadata corresponding to each cell. At a minimum, the meta data must include columns labeled ‘donors’ and ‘ctypes’, but it may include other variables of interest. This object should be of class ‘data.frame’: http://pklab.med.harvard.edu/jonathan/data/pbmc_meta_v2.rds

  3. If the genes are labeled with symbols other than gene symbols, it can be helpful to provide the data for converting gene names. The following object is of class ‘data.frame’: http://pklab.med.harvard.edu/jonathan/data/genes.rds

The code used to produce/clean these objects can be found here: http://pklab.med.harvard.edu/jonathan/data/PBMC_preprocessing.R

First, we’ll load the package and the downloaded data from above. Make sure to change the directories to where you downloaded the data.

library(scITD)
#> Loading required package: Matrix

# counts matrix
pbmc_counts <- readRDS('/home/jmitchel/data/van_der_wijst/pbmc_counts_v2.rds')

# meta data matrix
pbmc_meta <- readRDS('/home/jmitchel/data/van_der_wijst/pbmc_meta_v2.rds')

# ensembl to gene name conversions
feature.names <- readRDS('/home/jmitchel/data/van_der_wijst/genes.rds')

Next, we need to set up a project container. This is done in two steps:
1. Set project parameters, specifying the cell types to include
2. Create an object to hold the data as well as results and other information

# set up project parameters
param_list <- initialize_params(ctypes_use = c("CD4+ T", "CD8+ T", "cMonocyte", "CD56(dim) NK", "B"),
                                ncores = 30, rand_seed = 10)

# create project container
pbmc_container <- make_new_container(count_data=pbmc_counts, 
                                     meta_data=pbmc_meta,
                                     gn_convert = feature.names, 
                                     params=param_list,
                                     label_donor_sex = TRUE)

Since our meta data did not have information regarding donor sex, we set label_donor_sex = TRUE to extract this information from the expression of sex-specific genes. This new information is automatically added to our metadata. While having this information is not absolutely necessary, it can be useful later on to see if we identify a factor that is associated with sex.

Next, we will form our data into a tensor with dimensions of donors x genes x cell-types. The parameters norm_method and scale_factor are used for normalizing the pseudobulked data. The parameters vargenes_method and vargenes_thresh are used for identifying overdispersed genes. Lastly, scale_var and var_scale_power are used for scaling the variance of each gene in each cell type slice of the final tensor.

The settings shown here will work well for most data, though it is worth playing with the var_scale_power if a sex-associated factor is not apparent. Typically a var_scale_power value between 0.5-2 yields good results. See the documentation for further details with all parameters.

# form the tensor from the data
pbmc_container <- form_tensor(pbmc_container, donor_min_cells=5,
                              norm_method='trim', scale_factor=10000,
                              vargenes_method='norm_var_pvals', vargenes_thresh=.1,
                              scale_var = TRUE, var_scale_power = 2)
#> parsing data matrix by cell/tissue type...
#> cleaning data...
#> Keeping 41 donors. All donors have at least 5 cells in each cell type included.
#> collapsing count matrices from cells to donors (aka pseudobulk operation)...
#> normalizing data...
#> calculating gene overdispersion factors...
#> selecting highly variable genes from each cell type...
#> scaling variance...
#> forming tensor...
#> Complete!

After running this function, one should check the number of overdispersed genes identified to make sure the tensor has a decent number of genes before running the decomposition. This can be checked by running the following line of code:

# number of genes included in the tensor
print(length(pbmc_container[["all_vargenes"]]))
#> [1] 741

To increase the number of genes included, simply rerun form_tensor() with a higher vargenes_thresh value.

We are now ready to run the Tucker decomposition and analyze the interesting biological variation present in our dataset! We specify rotation_type='hybrid' to indicate that we want to optize the factor loadings using a hybrid rotation method. This improves interpretability of the factors by making the gene patterns more modular/independent from one another. By setting ranks=c(5,10), we indicate that we want to extract 5 multicellular processes that vary across donors using 10 gene sets. To determine an appropriate ranks setting, there are a helpful tips we suggest at the bottom of this tutorial.

# run the tensor decomposition
pbmc_container <- run_tucker_ica(pbmc_container, ranks=c(5,10),
                                 tucker_type = 'regular', rotation_type = 'hybrid')

Next, we will plot the donor scores matrix that tells us the level to which each process/factor is present in each donor. We also show some associations between the donor scores and two metadata variables, sex and lanes.

# get donor scores-metadata associations
pbmc_container <- get_meta_associations(pbmc_container, vars_test=c('sex','lanes'), stat_use='pval')

# plot donor scores
pbmc_container <- plot_donor_matrix(pbmc_container, meta_vars=c('sex','lanes'),
                                    show_donor_ids = TRUE,
                                    add_meta_associations='pval')

# show the donor scores heatmap
pbmc_container$plots$donor_matrix

We can see that factor 3 is significantly associated with donor sex and factor 5 is significantly associated with the 10x lane batch.

Next, it will be useful to determine which genes from each cell type are significantly associated with each factor. This is done as follows:

# get significant genes
pbmc_container <- get_lm_pvals(pbmc_container)

Now let’s look at the corresponding loadings matrices for each factor reduced to only the significant genes.

# generate the loadings plots
pbmc_container <- get_all_lds_factor_plots(pbmc_container, 
                                           use_sig_only=TRUE,
                                           nonsig_to_zero=TRUE,
                                           sig_thresh=.02,
                                           display_genes=FALSE,
                                           gene_callouts = TRUE,
                                           callout_n_gene_per_ctype=3,
                                           show_var_explained = TRUE)

# arrange the plots into a figure and show the figure
myfig <- render_multi_plots(pbmc_container,data_type='loadings')
myfig

We see that the significant loadings for the sex-associated factor, factor 3, include genes on the X and Y chromosomes as expected. We can interpret positive loading genes (e.g. XIST) as upregulated in donors with positive donor scores for factor 3 (i.e. females) and vice versa. We also note that the 10x lane-associated factor, factor 5, involves the hemoglobin genes, which appear to be upregulated in lane5. This likely represents contamination of this lane batch with some red blood cells. The other factors show some interesting patterns consisting of processes specific to different cell types.

We can get a better sense for the cell-type-level processes involved in a factor by running GSEA. The following function can be run using a variety of ontology resources (see documentation) and in signed or unsigned mode. Here we run GSEA specifically for factor 1:

pbmc_container <- run_gsea_one_factor(pbmc_container, factor_select=1, method="fgsea", thresh=0.05, db_use=c("GO"), signed=TRUE)
#> Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.14% of the list).
#> The order of those tied genes will be arbitrary, which may produce unexpected results.
#> Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.41% of the list).
#> The order of those tied genes will be arbitrary, which may produce unexpected results.
#> Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.08% of the list).
#> The order of those tied genes will be arbitrary, which may produce unexpected results.
#> Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (3.25% of the list).
#> The order of those tied genes will be arbitrary, which may produce unexpected results.

We can see that factor 1 is characterized by enrichment of interferon stimulated genes in most cell types with enrichment of chemotaxis-related genes jointly present in the monocytes. Connecting this back to the donor scores, it appears that this multicellular-process as a whole is most strongly upregulated in donors s45, s40, and s7, who have large negative donor scores for the factor.

You may also want to access the donor scores or loadings matrices for your own downstream analyses. The full donor scores matrix can be found in pbmc_container[["tucker_results"]][[1]] and the unfolded loadings tensor can be found in pbmc_container[["tucker_results"]][[2]]. You can also access the donor scores and loadings matrix (genes x cell-types) for a single factor with the function below:

f1_data <- get_one_factor(pbmc_container, factor_select=1)
f1_dscores <- f1_data[[1]]
f1_loadings <- f1_data[[2]]

print(head(f1_dscores))
#>           [,1]
#> s1  0.07270559
#> s6 -0.01965080
#> s2  0.08810651
#> s3  0.07955338
#> s5 -0.01027193
#> s4 -0.11242261
print(head(f1_loadings))
#>                           B     CD4+ T CD56(dim) NK      CD8+ T    cMonocyte
#> ENSG00000002549  -3.5616512 -13.395114   -6.0742834  -4.6373796   -0.3021236
#> ENSG00000002586  -0.8410424  -7.746031   -2.0108996  -1.2594313    0.2887184
#> ENSG00000002933  -0.1509180  -1.443698   -0.6609559  -1.5202271    4.3516505
#> ENSG00000004468  -1.1793914  -1.320247   -2.4328156   1.0749901   -0.9678019
#> ENSG00000006075 -12.4856558 -12.490182  -12.2861332 -47.8864867 -103.7003393
#> ENSG00000006638   0.1477783   2.411675    0.1635843   0.8973208    0.1239082

We can also compute p-values for the association of each gene (in each cell type) with a factor of interest. These are automatically corrected for multiple hypothesis testing with FDR adjustment.

f1_pvals <- get_one_factor_gene_pvals(pbmc_container, factor_select=1)

print(head(f1_pvals))
#>                          B     CD4+ T CD56(dim) NK      CD8+ T    cMonocyte
#> ENSG00000002549 0.01700424 0.00151748   0.05553872 0.045267107 9.490782e-01
#> ENSG00000002586 0.10371965 0.32520183   0.74227040 0.718039528 8.878944e-01
#> ENSG00000002933 0.84398685 0.54425710   0.89105352 0.498979123 7.662047e-01
#> ENSG00000004468 0.96488056 0.99125444   0.10009765 0.960752511 7.592992e-01
#> ENSG00000006075 0.13576734 0.58425744   0.88451128 0.000234537 6.111966e-11
#> ENSG00000006638 0.63226324 0.86861225   0.99433358 0.950913196 1.000000e+00

Rank Selection: As noted above, the choice of ranks to use in the Tucker decomposition is not immediately obvious. To reiterate, the first element of this parameter represents the total number of factors that will be extracted. The second element of the parameter is used internally, and it roughly represents the number of unique sets of co-varying genes. To help select an appropriate number of ranks, we implemented a similar strategy to one which is commonly used with matrix decompositions. Briefly, we look at changes in reconstruction error when increasing the number of factors and compare this to the change in error seen with a randomly shuffled dataset. Here, we compute this using SVD on the tensor unfolded along either the donor dimension or the genes dimension to determine the approximate values for the first and second elements of the ranks parameter, respectively.

# get assistance with rank determination
pbmc_container <- determine_ranks_tucker(pbmc_container, max_ranks_test=c(10,15),
  shuffle_level='cells', 
  num_iter=10, 
  norm_method='trim',
  scale_factor=10000,
  scale_var=TRUE,
  var_scale_power=2)
#> Warning: replacing previous import 'lifecycle::last_warnings' by 'rlang::last_warnings' when loading 'hms'

pbmc_container$plots$rank_determination_plot

From the top right plot, we can see that when the number of donor factors goes above 5, then the reduction in error no longer exceeds that which is seen in with the shuffled data. This is why we used 5 main factors in our analyses above. The bottom right plot suggests to use around 8 gene factors.

Another way to help determine the appropriate number of ranks is to evaluate the stability of the factors when the decomposition is run on the dataset subsampled to a smaller fraction of donors. If too many factors are extracted and thus capture noise, then these factors will likely not demonstrate high reproducibility. To apply this approach we use the run_stability_analysis() function. The sub_prop parameter determines the fraction of donors to keep in each subsampling iteration.

# run stability analysis
pbmc_container <- run_stability_analysis(pbmc_container,ranks=c(5,10),n_iterations=50,subset_type='subset', sub_prop=.95)
pbmc_container$plots$stability_plot_dsc

Here, we observe good stability of the factors as indicated by the high correlations between donor scores from the original decomposition to those of each subsampled decomposition. This indicates that our factors more likely represent biology than random noise.

There are some additional signs one can look for when determining the appropriate number of ranks to use. One simple approach is to check to see if any factors have a very small number of significant genes, as this may indicate a the factor represents noise. Additionally, one can check for certain expected signals such as a sex-associated factor. If one is not seen, then it may be an indication that a higher number of ranks is needed.

This concludes the walkthrough! We demonstrated the basic steps and post-processing tools needed for using scITD. The package also includes additional tools to investigate ligand-receptor interactions and cell proportion shifts that are associated with the factors.