I am trying to process a scRNA-seq data set generated with SmartSeq2 using `scater` v1.6.1. I quantified gene expression using Salmon and then loaded it into scater with `readSalmonResults` as described in the vignette. I then summarised the transcript-level expression into genes using `sce_gene = summariseExprsAcrossFeatures(sce_transcripts, exprs_values="tpm", summarise_by="feature_id")`.
However, I am now very confused about what the meaning of `counts(sce_gene)` is and how I correctly deal with functions that work with counts by default. I provided `exprs_values="tpm"` to summariseExprsAcrossFeatures() because summing up the read counts from individual transcripts does not make sense to me (due to different lengths) and this seems to be confirmed by the documentation. However, if this is the case I don't understand why `counts(sce_gene)` is set at all, and what its meaning is.
For example, it seems like I should now run `normalise(sce_gene)` to calculate logcounts, which are used by functions like plotExpression(). However, this seems to operate on the counts() slot by default. Isn't this just going to give me nonsensical values? Should I be using `normalise(sce_gene, exprs_values='tpm')` instead? But then I wouldn't end up with logcounts but with logtpms, wouldn't I? Or is logcounts actually a misnomer and it actually represents something more like logtpms (scaled by gene length and library size)?
Similarly, scran::cyclone() seems to operate on the counts by default. Should I be specifying TPMs instead there? Can I still use the pre-computed pairs from the package?
And finally, scran::trendVar() crashes when I run it on sce_gene because it apparently expects the SCE to have sizeFactors(). However, sizeFactors() would only be set if I run computeSumFactors(), which would only make sense if I had some sort of proper counts matrix. How do I deal with that? Can I just set `sizeFactors(sce_gene) = rep(1, ncol(sce_gene))` to signal that sizeFactors were not calculated by scran?
edit: After some detective work it looks like scater might actually be re-calculating the counts from the TPMs inside summariseExprsAcrossFeatures, as `tpm * lib_size * 1e-06`. If I'm reading this right this should result in some sort of gene-length (but not library-size) adjusted count. Is that correct? Does this mean I should just use the counts afterall, or are TPM still better?
Thank you for your reply Aaron!
So you think it shouldn't make a difference for trendVar whether the input is counts or tpm?
I am still a bit confused by the documentation of that function though, since it says "Log-transformed values are used as these tend to be more robust to genes with strong expression in only one or two outlier cells." but then mentions counts as valid input. Does this mean that it will always log the input matrix internally, so I should NOT give it logcounts/logtpm? Or does it mean that it expects logged values? Or does it detect whether it is receiving logged or unlogged values? How?
Regarding `cyclone`: If I understand the method correctly, this only looks at the relative ranks of pairs of genes, doesn't it? In that case I don't see how it could make any difference whether I give it counts or TPMs, whether they are logged or not. As long as the ranks don't change it should stay the same, shouldn't it? The only exception would be if the training data comes from on non-gene-length-normalized counts, but hopefully that's not the case?
And yes, I agree that putting these types of values on a log-scale is bound to lead to weird edge cases. I have seen approaches such as using +min(TPM) instead of +1 in the transformation, but then you do end up with values <1 and thus ugly negative values.
Maybe one option might be to calculate log(transcripts+1), ie. log(TPM*1e6+1)? That way at least the bias you introduce would be smaller, since most values will be >>1. However I'm sure this could lead to all sorts of other weird issues...
Or maybe try something like sqrt(tpm) or asinh(tpm) instead? If you use asinh you could even accommodate negative counts! :p
Log-TPMs would probably work just as well as log-counts, assuming you don't have any issues with the
+1
.For your second point: I assume you're referring to the description for the
assay.type
argument intrendVar
, which mentions"counts"
as a potential input. This is true in the sense that it is syntactically valid (i.e., "possible" in the same way that jumping off a bridge is physically possible). Whether it makes statistical sense is another matter, and I should probably change the documentation to have"logcounts"
as a suggestion (as is done in the default arguments).For your third point: yes,
cyclone
just looks at the ranks of different genes, but a gene's expected rank still depends on its expected abundance, which in turn depends on how you're quantifying expression, e.g., with UMIs or reads or whatnot. I can well imagine that this will differ depending on the length of the gene. I don't think the mouse training set used length normalization, I can only see mentions of size factors in Section 2.3 of the corresponding Methods paper. Clearly it didn't matter much, though, as they still got decent results when applying it on FPKM/RPKM values.For your final point:
transcripts = TPM * lib.size/1e6
, so we'll need the library size somehow. This seems to be whatsummariseExprsAcrossFeatures
does internally, giving us some counts that we wanted in the first place. And yes, we could use alternative transformations that would achieve variance stabilization, though it loses the nice interpretability of the log-scale, where differences/distances directly represent log-fold changes.