Question

Using Scran with RSEM/DEseq

0

Entering edit mode

a.guerra • 0

@aguerra-11252

Last seen 7.7 years ago

Hi,

I have been trying to use the scran package to normalise scRNA-seq data prior to DE analysis. The data was mapped with RSEM and a matrix of estimated counts fed into scran as shown in the tutorial.

I have a couple of questions that I would like to double-check before I proceed:

I have computed the computeSumFactors() factors and they seem to make sense, but I would like to visualise the effect of the normalisation on the data. The PCA plots don't change (is there a different way to parameterise the function to take this into account?). It would also be nice to have clearer instructions on how to export the normalised counts for application in different approaches (e.g. computing distances for clustering/tSNE/etc).

Visualisation aside, when I try to export the data for DESeq it fails complaining that values are not integers. Can't blame it too much because my RSEM counts are not integers to start with, but a cleaner failing would be nicer.

Assuming I can't use DESeq2 for this because of the non-integer counts, I'm happy to use edgeR. Is there a specific place in the SCE object where I should be sorting the design matrix and factors so that they are automatically inserted in the edgeR object that gets exported by scran? Or should I just go about modifying the object after export to add these in correctly?

Many thanks!

> sessionInfo();

R version 3.3.1 (2016-06-21)

Platform: x86_64-apple-darwin13.4.0 (64-bit)

Running under: OS X 10.11.5 (El Capitan)

locale:

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:

[1] stats4 parallel stats graphics grDevices utils datasets

[8] methods base

other attached packages:

[1] dplyr_0.5.0 biomaRt_2.28.0

[3] DESeq2_1.12.3 SummarizedExperiment_1.2.3

[5] GenomicRanges_1.24.2 GenomeInfoDb_1.8.3

[7] IRanges_2.6.1 S4Vectors_0.10.2

[9] edgeR_3.14.0 limma_3.28.17

[11] scran_1.0.3 scater_1.0.4

[13] ggplot2_2.1.0 Biobase_2.32.0

[15] BiocGenerics_0.18.0 BiocParallel_1.6.3

[17] tsne_0.1-3

loaded via a namespace (and not attached):

[1] viridis_0.3.4 dynamicTreeCut_1.63-1 splines_3.3.1

[4] Formula_1.2-1 shiny_0.13.2 assertthat_0.1

[7] latticeExtra_0.6-28 RSQLite_1.0.0 lattice_0.20-33

[10] chron_2.3-47 digest_0.6.10 RColorBrewer_1.1-2

[13] XVector_0.12.1 colorspace_1.2-6 htmltools_0.3.5

[16] httpuv_1.3.3 Matrix_1.2-6 plyr_1.8.4

[19] XML_3.98-1.4 genefilter_1.54.2 zlibbioc_1.18.0

[22] xtable_1.8-2 scales_0.4.0 tibble_1.1

[25] annotate_1.50.0 lazyeval_0.2.0 nnet_7.3-12

[28] survival_2.39-5 magrittr_1.5 mime_0.5

[31] foreign_0.8-66 shinydashboard_0.5.1 tools_3.3.1

[34] data.table_1.9.6 matrixStats_0.50.2 stringr_1.0.0

[37] munsell_0.4.3 locfit_1.5-9.1 cluster_2.0.4

[40] AnnotationDbi_1.34.4 rhdf5_2.16.0 grid_3.3.1

[43] RCurl_1.95-4.8 tximport_1.0.3 rjson_0.2.15

[46] bitops_1.0-6 gtable_0.2.0 DBI_0.4-1

[49] reshape2_1.4.1 R6_2.1.2 gridExtra_2.2.1

[52] zoo_1.7-13 Hmisc_3.17-4 stringi_1.1.1

[55] Rcpp_0.12.6 geneplotter_1.50.0 rpart_4.1-10

[58] acepack_1.3-3.3

scran normalization deseq2 rsem edger • 2.0k views

ADD COMMENT • link 7.7 years ago a.guerra • 0

score 0 · Answer 1 · 2016-08-08

If you want to get normalized log-expression values, all you have to do is:

sce <- computeSumFactors(sce, ...) # "..." standing for other arguments
sce <- normalize(sce)
exprs(sce) # use for whatever you want.

This process should be described in both the vignette and workflow, so I'm not sure what's unclear about it.

scran doesn't have PCA (or t-SNE, or MDS) plotting functions, but scater does. These automatically use the exprs from the SCESet object, so you can just call them after running normalize. In general, though, normalization tends not to have a big effect on clustering or dimensionality reduction, as subpopulations that are well-separated in high-dimensional space will remain so regardless of what type of normalization you apply.

Rather, I visualize the results by plotting the size factors against the library sizes to examine the spread of size factors (to see the variability in technical biases/RNA content across the population) and to determine any composition biases are present (if the trend does not lie on a straight line). You could use MA plots as well, but the counts tend to be too low and variable for that.

Finally, if you want to export the data for use in DESeq2 or edgeR, you could consider using the convertTo function:

y <- convertTo(sce, type="edgeR")
dds <- convertTo(sce, type="DESeq2")

... rather than doing it manually. There are subtleties regarding size factor conversion that need to be handled properly.

score 0 · Answer 2 · 2016-08-08

0

Entering edit mode

a.guerra • 0

@aguerra-11252

Last seen 7.7 years ago

Thanks,

Those were exactly the functions I used. The error in DESeq2 still remains, and the problem with the PCA is that it doesn't change at all, not much, not little, nothing. I will try the plot you suggest though. I didn't realise the PCA function was coming from a different package.

The error I described is in the convertTo function when trying to convert to DESeq2. Doing it manually is exactly what I'm trying to avoid. Should I assume that there is no way to add the design matrix to the object directly?

ADD COMMENT • link 7.7 years ago a.guerra • 0

0

Entering edit mode

"Nothing" is a strong statement. Are you saying that the PCA coordinates are exactly the same after normalization? I find this hard to believe. Try running this code:

sce <- newSCESet(...) # make a new SCESet without storing any size factors.
oldp <- plotPCA(sce)
sce <- computeSumFactors(sce)
sce <- normalize(sce)
newp <- plotPCA(sce)

... and then compare the two sets of coordinates:

ggplot2::ggplot_build(oldp)$data[[1]][,c("x", "y")]
ggplot2::ggplot_build(newp)$data[[1]][,c("x", "y")]

These values shouldn't be identical unless the size factors are directly proportional to the library size (as the original exprs are computed with library size normalization). And the latter should be impossible, as estimation errors should be present.

As for DESeq2, it's hard to imagine a more straightforward error message. Non-integer values are not supported, so if you want to overcome that, give it integers. I won't perform automatic coercion to integers in convertTo because the sensibility of rounding non-integer values is context-dependent. The user is responsible for deciding whether rounding is appropriate.

Finally, in edgeR, there's no reason or support for adding a design matrix to a DGEList object. Design matrices are only stored in DGEGLM objects, and this is done by supplying the design matrix to glmFit or friends. You can add group information via the ... argument in convertTo - see ?DGEList for the relevant argument - but that's about it.

ADD REPLY • link 7.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Just to echo Aaron, you can and should round estimated counts to integers before inputting to DESeq2. This is a minor hurdle for the end user, but an intentional one, because we want to emphasize that the incoming values are on the scale of counts and not normalized in any way. Rounding an estimated count results in nearly no loss in precision.

ADD REPLY • link 7.7 years ago Michael Love 41k

score 0 · Answer 3 · 2016-08-08

0

Entering edit mode

a.guerra • 0

@aguerra-11252

Last seen 7.7 years ago

I'll do that and get back to you. Thanks! Yes, I've found them to be exactly the same, hence me thinking something is not going as expected and asking for your help to find the root cause of the problems. The rest were interface questions and are all clarified now, so thanks for that.

ADD COMMENT • link 7.7 years ago a.guerra • 0

0

Entering edit mode

For future reference, respond to posts using the "add comment" or "add reply" buttons.

Also, note that the code below will not give correct results in the release version of scater:

sce <- newSCESet(...)
sce2 <- computeSumFactors(sce) # or any method of adding size factors.
sce2 <- normalize(sce2)
oldp <- plotPCA(sce)
newp <- plotPCA(sce2)

The resulting coordinates will be identical, which is wrong. This is due to an issue with pass-by-reference behaviour inside environments of the SCESet, which was fixed a while ago in the devel version but hasn't been propagated to release.

ADD REPLY • link 7.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

That justifies it then. I'll install the development version. Thanks for helping me getting to the bottom of this.

ADD REPLY • link 7.7 years ago a.guerra • 0

0

Entering edit mode

Just to confirm that the development version gives different results and makes a lot more sense! Thanks!

ADD REPLY • link 7.7 years ago a.guerra • 0