Question

scater/scran with TPMs from Salmon - counts?

0

Entering edit mode

nk • 0

@nk-7193

Last seen 6.3 years ago

United Kingdom

I am trying to process a scRNA-seq data set generated with SmartSeq2 using `scater` v1.6.1. I quantified gene expression using Salmon and then loaded it into scater with `readSalmonResults` as described in the vignette. I then summarised the transcript-level expression into genes using `sce_gene = summariseExprsAcrossFeatures(sce_transcripts, exprs_values="tpm", summarise_by="feature_id")`.

However, I am now very confused about what the meaning of `counts(sce_gene)` is and how I correctly deal with functions that work with counts by default. I provided `exprs_values="tpm"` to summariseExprsAcrossFeatures() because summing up the read counts from individual transcripts does not make sense to me (due to different lengths) and this seems to be confirmed by the documentation. However, if this is the case I don't understand why `counts(sce_gene)` is set at all, and what its meaning is.

For example, it seems like I should now run `normalise(sce_gene)` to calculate logcounts, which are used by functions like plotExpression(). However, this seems to operate on the counts() slot by default. Isn't this just going to give me nonsensical values? Should I be using `normalise(sce_gene, exprs_values='tpm')` instead? But then I wouldn't end up with logcounts but with logtpms, wouldn't I? Or is logcounts actually a misnomer and it actually represents something more like logtpms (scaled by gene length and library size)?

Similarly, scran::cyclone() seems to operate on the counts by default. Should I be specifying TPMs instead there? Can I still use the pre-computed pairs from the package?

And finally, scran::trendVar() crashes when I run it on sce_gene because it apparently expects the SCE to have sizeFactors(). However, sizeFactors() would only be set if I run computeSumFactors(), which would only make sense if I had some sort of proper counts matrix. How do I deal with that? Can I just set `sizeFactors(sce_gene) = rep(1, ncol(sce_gene))` to signal that sizeFactors were not calculated by scran?

edit: After some detective work it looks like scater might actually be re-calculating the counts from the TPMs inside summariseExprsAcrossFeatures, as `tpm * lib_size * 1e-06`. If I'm reading this right this should result in some sort of gene-length (but not library-size) adjusted count. Is that correct? Does this mean I should just use the counts afterall, or are TPM still better?

scater salmon scran • 1.7k views

ADD COMMENT • link updated 6.3 years ago by Aaron Lun ★ 28k • written 6.3 years ago by nk • 0

score 1 · Answer 1 · 2018-01-17

I'll answer for the scran part first. The vast majority of scran functions work on some count-based expression value; either log-expression values computed from counts, "normalized" counts, or the raw counts. Arguably I could relax these restrictions for trendVar to allow it to accept log-expression values. I've never had the need to do so, but I may make these changes in light of your experiences. In the meantime, setting all of the size factors to 1 should do the trick.

The only exception to the rule above is cyclone, which can theoretically work with any type of expression value. In practice, though, the usual caveats about the generalizability of classifiers apply here. The pairs were obtained by training on read count data, so your mileage may vary on other things. My feeling is that it would be fairly robust, as I've used the same classifiers for UMIs and it seems to do okay. Nonetheless, some caution is required.

I can't comment on the inner workings of summariseExprsAcrossFeatures, as I've never had the need to use it. But normalize is intended for counts - after all, tpms and the like are effectively already normalized, so to get log-values, all you have to do is:

assay(sce, "logtpm") <- log2(tpm(sce)+1)

... and then set exprs_values="logtpm" in scater functions or assay.type="logtpm" in scran functions, to tell the functions to use those log-expression values instead of looking for "logcounts".

Minor rant about +1 in log-transformations:

Note that the use of +1 is not quite as straightforward as you might think. If your TPMs are all large (>>1), then it's fine. But if your TPMs are small, then the +1 amounts to huge shrinkage towards a log-expression value of zero. This is most clearly demonstrated by considering a gene with an average TPM of 0.01 in one population and 0.02 in another population. Clearly there's a two-fold change in expression between these two, but if you use log2(TPM+1), you'll end up with an average log2-TPM of ~0.014 and ~0.028 (assuming E(log(X))~=log(E(X)) for simplicity). This is wrong because there should be a difference of 1 on the log-scale.

I never knew how to solve this, which is why the log-normalized expression values are computed from the counts by default in scater. In such cases, the shrinkage due to +1 is more predictable - easily interpreted as the addition of a single count, which has a big effect for small counts (where there's less information for computing log-fold changes) and less effect for larger counts. In contrast, a small TPM could feasibly be computed from very large counts if the transcript is long or the library size is large. The use of +1 would result in inappropriately strong shrinkage, needlessly causing the inaccurate calculation of the log-fold changes as observed above.

Perhaps it would be better to add a smaller value, but (i) I don't know how to pick it and (ii) people always complain to me when they get negative log-normalized values. "Oh, Aaron, I can't use non-negative matrix factorization!". "Why is my expression negative, that doesn't make any sense!". "I don't like the plots where the y-axis becomes negative, it looks ugly.". I mean, if we're putting values on a log-scale, why on earth should they be non-negative? Jeez. But now I've noticed I've gone completely off-topic, so I'll stop here.