scater/scran with TPMs from Salmon - counts?
1
0
Entering edit mode
nk • 0
@nk-7193
Last seen 5.2 years ago
United Kingdom

I am trying to process a scRNA-seq data set generated with SmartSeq2 using scater v1.6.1. I quantified gene expression using Salmon and then loaded it into scater with readSalmonResults as described in the vignette. I then summarised the transcript-level expression into genes using sce_gene = summariseExprsAcrossFeatures(sce_transcripts, exprs_values="tpm", summarise_by="feature_id").

However, I am now very confused about what the meaning of counts(sce_gene) is and how I correctly deal with functions that work with counts by default. I provided exprs_values="tpm" to summariseExprsAcrossFeatures() because summing up the read counts from individual transcripts does not make sense to me (due to different lengths) and this seems to be confirmed by the documentation. However, if this is the case I don't understand why counts(sce_gene) is set at all, and what its meaning is.

For example, it seems like I should now run normalise(sce_gene) to calculate logcounts, which are used by functions like plotExpression(). However, this seems to operate on the counts() slot by default. Isn't this just going to give me nonsensical values? Should I be using normalise(sce_gene, exprs_values='tpm') instead? But then I wouldn't end up with logcounts but with logtpms, wouldn't I? Or is logcounts actually a misnomer and it actually represents something more like logtpms (scaled by gene length and library size)?

Similarly, scran::cyclone() seems to operate on the counts by default. Should I be specifying TPMs instead there? Can I still use the pre-computed pairs from the package?

And finally, scran::trendVar() crashes when I run it on sce_gene because it apparently expects the SCE to have sizeFactors(). However, sizeFactors() would only be set if I run computeSumFactors(), which would only make sense if I had some sort of proper counts matrix. How do I deal with that? Can I just set sizeFactors(sce_gene) = rep(1, ncol(sce_gene)) to signal that sizeFactors were not calculated by scran?

edit: After some detective work it looks like scater might actually be re-calculating the counts from the TPMs inside summariseExprsAcrossFeatures, as tpm * lib_size * 1e-06. If I'm reading this right this should result in some sort of gene-length (but not library-size) adjusted count. Is that correct? Does this mean I should just use the counts afterall, or are TPM still better?

scater salmon scran • 1.2k views
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 9 hours ago
The city by the bay

I'll answer for the scran part first. The vast majority of scran functions work on some count-based expression value; either log-expression values computed from counts, "normalized" counts, or the raw counts. Arguably I could relax these restrictions for trendVar to allow it to accept log-expression values. I've never had the need to do so, but I may make these changes in light of your experiences. In the meantime, setting all of the size factors to 1 should do the trick.

The only exception to the rule above is cyclone, which can theoretically work with any type of expression value. In practice, though, the usual caveats about the generalizability of classifiers apply here. The pairs were obtained by training on read count data, so your mileage may vary on other things. My feeling is that it would be fairly robust, as I've used the same classifiers for UMIs and it seems to do okay. Nonetheless, some caution is required.

I can't comment on the inner workings of summariseExprsAcrossFeatures, as I've never had the need to use it. But normalize is intended for counts - after all, tpms and the like are effectively already normalized, so to get log-values, all you have to do is:

assay(sce, "logtpm") <- log2(tpm(sce)+1)


... and then set exprs_values="logtpm" in scater functions or assay.type="logtpm" in scran functions, to tell the functions to use those log-expression values instead of looking for "logcounts".

Minor rant about +1 in log-transformations:

Note that the use of +1 is not quite as straightforward as you might think. If your TPMs are all large (>>1), then it's fine. But if your TPMs are small, then the +1 amounts to huge shrinkage towards a log-expression value of zero. This is most clearly demonstrated by considering a gene with an average TPM of 0.01 in one population and 0.02 in another population. Clearly there's a two-fold change in expression between these two, but if you use log2(TPM+1), you'll end up with an average log2-TPM of ~0.014 and ~0.028 (assuming E(log(X))~=log(E(X)) for simplicity). This is wrong because there should be a difference of 1 on the log-scale.

I never knew how to solve this, which is why the log-normalized expression values are computed from the counts by default in scater. In such cases, the shrinkage due to +1 is more predictable - easily interpreted as the addition of a single count, which has a big effect for small counts (where there's less information for computing log-fold changes) and less effect for larger counts. In contrast, a small TPM could feasibly be computed from very large counts if the transcript is long or the library size is large. The use of +1 would result in inappropriately strong shrinkage, needlessly causing the inaccurate calculation of the log-fold changes as observed above.

Perhaps it would be better to add a smaller value, but (i) I don't know how to pick it and (ii) people always complain to me when they get negative log-normalized values. "Oh, Aaron, I can't use non-negative matrix factorization!". "Why is my expression negative, that doesn't make any sense!". "I don't like the plots where the y-axis becomes negative, it looks ugly.". I mean, if we're putting values on a log-scale, why on earth should they be non-negative? Jeez. But now I've noticed I've gone completely off-topic, so I'll stop here.

0
Entering edit mode

So you think it shouldn't make a difference for trendVar whether the input is counts or tpm?

I am still a bit confused by the documentation of that function though, since it says "Log-transformed values are used as these tend to be more robust to genes with strong expression in only one or two outlier cells." but then mentions counts as valid input. Does this mean that it will always log the input matrix internally, so I should NOT give it logcounts/logtpm? Or does it mean that it expects logged values? Or does it detect whether it is receiving logged or unlogged values? How?

Regarding cyclone: If I understand the method correctly, this only looks at the relative ranks of pairs of genes, doesn't it? In that case I don't see how it could make any difference whether I give it counts or TPMs, whether they are logged or not. As long as the ranks don't change it should stay the same, shouldn't it? The only exception would be if the training data comes from on non-gene-length-normalized counts, but hopefully that's not the case?

And yes, I agree that putting these types of values on a log-scale is bound to lead to weird edge cases. I have seen approaches such as using +min(TPM) instead of +1 in the transformation, but then you do end up with values <1 and thus ugly negative values.

Maybe one option might be to calculate log(transcripts+1), ie. log(TPM*1e6+1)? That way at least the bias you introduce would be smaller, since most values will be >>1. However I'm sure this could lead to all sorts of other weird issues...

Or maybe try something like sqrt(tpm) or asinh(tpm) instead? If you use asinh you could even accommodate negative counts! :p

0
Entering edit mode

Log-TPMs would probably work just as well as log-counts, assuming you don't have any issues with the +1.

For your second point: I assume you're referring to the description for the assay.type argument in trendVar, which mentions "counts" as a potential input. This is true in the sense that it is syntactically valid (i.e., "possible" in the same way that jumping off a bridge is physically possible). Whether it makes statistical sense is another matter, and I should probably change the documentation to have "logcounts" as a suggestion (as is done in the default arguments).

For your third point: yes, cyclone just looks at the ranks of different genes, but a gene's expected rank still depends on its expected abundance, which in turn depends on how you're quantifying expression, e.g., with UMIs or reads or whatnot. I can well imagine that this will differ depending on the length of the gene. I don't think the mouse training set used length normalization, I can only see mentions of size factors in Section 2.3 of the corresponding Methods paper. Clearly it didn't matter much, though, as they still got decent results when applying it on FPKM/RPKM values.

For your final point: transcripts = TPM * lib.size/1e6, so we'll need the library size somehow. This seems to be what summariseExprsAcrossFeatures does internally, giving us some counts that we wanted in the first place. And yes, we could use alternative transformations that would achieve variance stabilization, though it loses the nice interpretability of the log-scale, where differences/distances directly represent log-fold changes.