Hi,
I have been trying to use the scran package to normalise scRNA-seq data prior to DE analysis. The data was mapped with RSEM and a matrix of estimated counts fed into scran as shown in the tutorial.
I have a couple of questions that I would like to double-check before I proceed:
I have computed the computeSumFactors() factors and they seem to make sense, but I would like to visualise the effect of the normalisation on the data. The PCA plots don't change (is there a different way to parameterise the function to take this into account?). It would also be nice to have clearer instructions on how to export the normalised counts for application in different approaches (e.g. computing distances for clustering/tSNE/etc).
Visualisation aside, when I try to export the data for DESeq it fails complaining that values are not integers. Can't blame it too much because my RSEM counts are not integers to start with, but a cleaner failing would be nicer.
Assuming I can't use DESeq2 for this because of the non-integer counts, I'm happy to use edgeR. Is there a specific place in the SCE object where I should be sorting the design matrix and factors so that they are automatically inserted in the edgeR object that gets exported by scran? Or should I just go about modifying the object after export to add these in correctly?
Many thanks!
> sessionInfo();
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] dplyr_0.5.0 biomaRt_2.28.0
[3] DESeq2_1.12.3 SummarizedExperiment_1.2.3
[5] GenomicRanges_1.24.2 GenomeInfoDb_1.8.3
[7] IRanges_2.6.1 S4Vectors_0.10.2
[9] edgeR_3.14.0 limma_3.28.17
[11] scran_1.0.3 scater_1.0.4
[13] ggplot2_2.1.0 Biobase_2.32.0
[15] BiocGenerics_0.18.0 BiocParallel_1.6.3
[17] tsne_0.1-3
loaded via a namespace (and not attached):
[1] viridis_0.3.4 dynamicTreeCut_1.63-1 splines_3.3.1
[4] Formula_1.2-1 shiny_0.13.2 assertthat_0.1
[7] latticeExtra_0.6-28 RSQLite_1.0.0 lattice_0.20-33
[10] chron_2.3-47 digest_0.6.10 RColorBrewer_1.1-2
[13] XVector_0.12.1 colorspace_1.2-6 htmltools_0.3.5
[16] httpuv_1.3.3 Matrix_1.2-6 plyr_1.8.4
[19] XML_3.98-1.4 genefilter_1.54.2 zlibbioc_1.18.0
[22] xtable_1.8-2 scales_0.4.0 tibble_1.1
[25] annotate_1.50.0 lazyeval_0.2.0 nnet_7.3-12
[28] survival_2.39-5 magrittr_1.5 mime_0.5
[31] foreign_0.8-66 shinydashboard_0.5.1 tools_3.3.1
[34] data.table_1.9.6 matrixStats_0.50.2 stringr_1.0.0
[37] munsell_0.4.3 locfit_1.5-9.1 cluster_2.0.4
[40] AnnotationDbi_1.34.4 rhdf5_2.16.0 grid_3.3.1
[43] RCurl_1.95-4.8 tximport_1.0.3 rjson_0.2.15
[46] bitops_1.0-6 gtable_0.2.0 DBI_0.4-1
[49] reshape2_1.4.1 R6_2.1.2 gridExtra_2.2.1
[52] zoo_1.7-13 Hmisc_3.17-4 stringi_1.1.1
[55] Rcpp_0.12.6 geneplotter_1.50.0 rpart_4.1-10
[58] acepack_1.3-3.3
"Nothing" is a strong statement. Are you saying that the PCA coordinates are exactly the same after normalization? I find this hard to believe. Try running this code:
... and then compare the two sets of coordinates:
These values shouldn't be identical unless the size factors are directly proportional to the library size (as the original
exprs
are computed with library size normalization). And the latter should be impossible, as estimation errors should be present.As for DESeq2, it's hard to imagine a more straightforward error message. Non-integer values are not supported, so if you want to overcome that, give it integers. I won't perform automatic coercion to integers in
convertTo
because the sensibility of rounding non-integer values is context-dependent. The user is responsible for deciding whether rounding is appropriate.Finally, in edgeR, there's no reason or support for adding a design matrix to a
DGEList
object. Design matrices are only stored inDGEGLM
objects, and this is done by supplying the design matrix toglmFit
or friends. You can add group information via the...
argument inconvertTo
- see?DGEList
for the relevant argument - but that's about it.Just to echo Aaron, you can and should round estimated counts to integers before inputting to DESeq2. This is a minor hurdle for the end user, but an intentional one, because we want to emphasize that the incoming values are on the scale of counts and not normalized in any way. Rounding an estimated count results in nearly no loss in precision.