Search
Question: How does newSCESet calculate expression matrix
0
16 months ago by
siajunren0 wrote:

In the Scater package, according to the vignette, newSCESet generates the exprs slot as log2(counts-per-million) using the cpm function from edgeR, with a “prior count” value of 1. However, this is not the case when I checked.

library(scater)
set.seed(221)
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
sce2=newSCESet(countData=y, logExprsOffset = 1)
A=get_exprs(sce2, exprs_values = 'exprs')
B=log2(cpm(y)+1)

A looks varsely different from B.

modified 16 months ago by davis90 • written 16 months ago by siajunren0
0
16 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

The vignette is wrong. The exprs are log-(normalized) count values, and are closer to log2(count + 1) after adjusting the counts for differences in the library sizes between cells. They are not log-CPMs.

See https://github.com/davismcc/scater/issues/110 for a related discussion.

That’s a reminder to update the vignette…which is on the to-do list. Davis — Davis McCarthy NHMRC Early Career Fellow Stegle Group EMBL-EBI, Cambridge, UK www.ebi.ac.uk > On Thursday, May 25, 2017 at 5:26 pm, Aaron Lun [bioc] <noreply@bioconductor.org mailto:noreply@bioconductor.org)=""> wrote: > Activity on a post you are following on support.bioconductor.org (https://support.bioconductor.org) > > User Aaron Lun (Aaron Lun) wrote Answer: How does newSCESet calculate expression matrix (https://support.bioconductor.org/p/96288/#96289): > > > The vignette is wrong. The exprs are log-(normalized) count values, and are closer to log2(count + 1) after adjusting the counts for differences in the library sizes between cells. They are not log-CPMs. > > > See https://github.com/davismcc/scater/issues/110 for a related discussion. > > > Post tags: scater > > > You may reply via email or visit A: How does newSCESet calculate expression matrix > >
0
16 months ago by
davis90
United Kingdom
davis90 wrote:
OK, you haven’t told us what version of scater you’re using, so I’ll assume it’s the latest version, 1.4.0. From the documentation for newSCESet (Details section): “...an SCESet object will contain exprs values; these will be computed as log2(*pm + offset) values if a data type higher in the hierarchy is supplied as the expression matrix.” So if count data are supplied, then exprs will be log2(counts-per-million + logExprsOffset). This is indeed different from the output of edgeR::cpm, which uses the prior counts. The edgeR::cpm behaviour was not particularly transparent/expected for many users, hence the change to the current approach. Best wishes Davis — Davis McCarthy NHMRC Early Career Fellow Stegle Group EMBL-EBI, Cambridge, UK www.ebi.ac.uk > On Thursday, May 25, 2017 at 5:08 pm, siajunren [bioc] <noreply@bioconductor.org mailto:noreply@bioconductor.org)=""> wrote: > Activity on a post you are following on support.bioconductor.org (https://support.bioconductor.org) > > User siajunren (siajunren) wrote Question: How does newSCESet calculate expression matrix (https://support.bioconductor.org/p/96288/): > > > In the Scater package, according to the vignette, newSCESet generates the exprs slot as log2(counts-per-million) using the cpm function from edgeR, with a “prior count” value of 1. However, this is not the case when I checked. > > library(scater) set.seed(221) y <- matrix(rnbinom(20,size=1,mu=10),5,4) sce2=newSCESet(countData=y, logExprsOffset = 1) A=get_exprs(sce2, exprs_values = 'exprs') B=log2(cpm(y)+1) > > A looks varsely different from B. > > > Post tags: scater > > > You may reply via email or visit How does newSCESet calculate expression matrix > >

On a slightly related note I noticed a dramatic change in my results using scater when I recently updated R from version 3.3.2 to version 3.4.1, leading to the update from scater_1.2.0 to scater_1.3.35. Could this be due to the change in how you calculate exprs

If using scater_1.4.0 makes a difference I haven't tested it yet as I'm dependent on the available binary packages. I'm not sure how to compile package source code on Windows.

The latest version in BioC-release should be 1.4.0. Though now that I look at the scater landing page, it seems that the Windows build for 1.4.0 is failing due to the vignette, which is rather disconcerting. I will nag Davis about this.

In any case, there's been a lot of changes between 1.2.0 and 1.4.0, so you'll have to be more specific about what happened to your results. The latest version is always the best in terms of bug-free'ness and accuracy, so use that if possible.

Thanks for the reply. I haven't had a chance to take a closer look at the changes in the last couple weeks. The most obvious effect is when I used trendVar and decomposeVar to identify highly variable genes the number of hvg's with an FDR <= 0.05 went from 807 to 0 in one case and 13 to 2 in another. I'll try to figure out where the actual problem is. Also I see that 1.4.0 for Windows is now avail and I'll see what happens with that.

Hm. I didn't think I changed much for those methods - only trend fitting when trend="semiloess".

I don't seem to be able to specify the trend method anymore. When I do I get an error message about an unused argument.

var.fit.2 <- trendVar(sce_2, trend="loess", use.spikes=FALSE, span=0.3)
Error in .trend_var(mat, ..., subset.row = subset.row) :
unused argument (trend = "loess")

Would there be a large difference between loess and semiloess? I compared using parametric=TRUE vs FALSE and only noticed a slight variation in the trend line.

sessionInfo()

R version 3.4.1 (2017-06-30)

Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] scran_1.5.8          BiocParallel_1.11.4  biomaRt_2.33.3
[4] scater_1.5.0         ggplot2_2.2.1        Biobase_2.37.2
[7] BiocGenerics_0.23.0  gplots_3.0.1         RColorBrewer_1.1-2
[10] edgeR_3.19.3         limma_3.33.7         openxlsx_4.0.17
[13] BiocInstaller_1.27.2

loaded via a namespace (and not attached):
[1] dynamicTreeCut_1.63-1 viridis_0.4.0         bit64_0.9-7
[4] viridisLite_0.2.0     splines_3.4.1         gtools_3.5.0
[7] shiny_1.0.3           assertthat_0.2.0      statmod_1.4.30
[10] stats4_3.4.1          blob_1.1.0            vipor_0.4.5
[13] progress_1.1.2        RSQLite_2.0           lattice_0.20-35
[16] glue_1.1.1            digest_0.6.12         colorspace_1.3-2
[19] htmltools_0.3.6       httpuv_1.3.5          Matrix_1.2-10
[22] plyr_1.8.4            XML_3.98-1.9          pkgconfig_2.0.1
[25] zlibbioc_1.23.0       xtable_1.8-2          scales_0.4.1
[28] gdata_2.18.0          tibble_1.3.3          IRanges_2.11.12
[31] DT_0.2                lazyeval_0.2.0        magrittr_1.5
[34] mime_0.5              memoise_1.1.0         FNN_1.1
[37] beeswarm_0.2.3        shinydashboard_0.6.1  tools_3.4.1
[40] data.table_1.10.4     prettyunits_1.0.2     matrixStats_0.52.2
[43] stringr_1.2.0         S4Vectors_0.15.5      munsell_0.4.3
[46] locfit_1.5-9.1        AnnotationDbi_1.39.2  bindrcpp_0.2
[49] compiler_3.4.1        caTools_1.17.1        rlang_0.1.1
[52] rhdf5_2.21.2          grid_3.4.1            RCurl_1.95-4.8
[55] tximport_1.5.0        htmlwidgets_0.9       rjson_0.2.15
[58] igraph_1.1.2          labeling_0.3          bitops_1.0-6
[61] gtable_0.2.0          DBI_0.7               reshape2_1.4.2
[64] R6_2.2.2              zoo_1.8-0             gridExtra_2.2.1
[67] dplyr_0.7.2           bit_1.1-12            bindr_0.1
[70] KernSmooth_2.23-15    stringi_1.1.5         ggbeeswarm_0.5.3
[73] Rcpp_0.12.12   

Oh, I didn't realize you were using the current BioC-devel version. (You should really be using the BioC-release version unless you like pain and suffering.) Yes, I've changed the interface so that it is more general and easier to use. I can't remember off the top of my head, but I think I called it method="loess" rather than trend. The bonus of using parametric=TRUE should be minimal in most cases, but can be beneficial when you have spike-ins that are not evenly distributed across the abundance range of the genes.

Ah thanks. I switched when I was trying to get the 1.4.0 version initially and forgot to switch back. :)

As far as I can tell the trend line is the first major difference between my analysis with the earlier version and the newest version of scater. As I'm a relative newbie in troubleshooting R can you suggest where I should look to further isolate the differences? I just want to know for sure what's going on - whether the data really isn't significant, or something else.

Is the trend that different - can you make a plot with the old and new versions of scran and share it? Are you fitting to spike-in transcripts or to endogenous genes? Frankly, the trend fit should only be affected by the changes if the trend wasn't good in the first place.

Below are my plots from the old (top) and new (bottom) versions with the trendline in blue. The HVGs (yellow) in the first graph are at an FDR of 0.05 and in the second graph at an FDR of 0.5.

Top graph code:

sce_1 <- computeSpikeFactors(sce_1, general.use=TRUE)

sce_1 <- normalize(sce_1)

var.fit.1 <- trendVar(sce_1, trend="loess", use.spikes=FALSE, span=0.2)

var.out.1 <- decomposeVar(sce_1, var.fit.1)

Bottom graph code:

sce_1 <- computeSpikeFactors(sce_1, general.use=TRUE)

sce_1 <- normalize(sce_1)

var.fit.1 <- trendVar(sce_1, use.spikes=NA, span=0.3)

var.out.1 <- decomposeVar(sce_1, var.fit.1)

My reference genes (red) for computeSpikeFactors are endogenous housekeeping genes that had the lowest GroupSD using Normfinder (one group analysis). I fully expect it's my data that is the problem. I'm still trying different reference genes to use as there are no spike-in transcripts avail. in this dataset.

Okay, I'm pretty sure I know the cause - trigger warning: statistics!

One of the changes between 1.2.0 and 1.4.0 was the inclusion of trend rescaling. This accounts for the fact that the trend was fitted to the log-variances; without rescaling, the trend represents the exponentified abundance-dependent mean of the log-variances, rather than the mean of the variances on the original scale (which is more relevant and is what we actually want to estimate).

The rescaling usually increases the trend value, as the exp-mean of logs is smaller than the mean of the raw values. The extent of rescaling depends on the scatter of variance estimates around the trend, and is usually minimal. In your case, however, the trend is weak and there is a lot of scatter, resulting in a big upward shift from the rescaling. This explains the reduction in the number of HVGs.

Personally, I would not trust any kind of trend fit for this data set. I would guess that you have very few cells, given the lower power to call HVGs - even though plenty of genes are above the trend. I would also guess that these cells were sequenced at very low depth for a read-based protocol, given that most points are stuck at the early part of the curve, rather than occupying the "crest" of the wave. Perhaps try using improvedCV2, which might give you a nicer trend.

As an aside, I should mention that we do not have much faith in the use of house-keeping genes for scRNA-seq data analysis. Biological heterogeneity seems to be too strong for their reliable use, even in the case where they are truly non-DE across the population. I would suggest computeSumFactors instead for normalization in the absence of spike-ins. I haven't used NormFinder, so I won't say whether it's a good idea or not; you're on your own there.

Thank you for the insight and suggestions. I'll give those a try.

Just to explain the data a bit. The RNA was collected ages ago from patched C. elegans neurons for a different purpose and then ended up sitting in the freezer for a few years. The RNA-seq data was obtained a couple years ago I think, so still fairly early in the development of the technique (?). There are 3 types of cells with 3-7 individual cells in each group. However, the RNA from the cell types was pooled prior to sequencing so I really only have 3 "cells" available to analyse. I suspect I'm using scater and scran incorrectly for this dataset, but I've been learning a lot on how to use R in the process. Plus, it's easy to annotate the SCEset using the C.elegans BiomaRt.

Wow. I haven't seen a 3-cell data set since the early days of single-cell genomics. As for what you can do with it: ¯\_(ツ)_/¯. Most scRNA-seq analysis packages exploit the presence of many cells to overcome the high technical noise in order to draw reliable inferences, but this isn't possible here. For comparison, see https://doi.org/10.1101/104844.