Question

Full length single-cell RNA-seq - tximport output for downstream analysis

2

Entering edit mode

lshepard ▴ 40

@lshepard-7434

Last seen 4.4 years ago

United States

Hello,

I would like to continue a topic that was first started on this Biostars post. Essentially, in an attempt to help the OP from that topic, I brought the point of downstream data analysis with full-length RNA-seq protocol such as Smart-Seq2 when one uses Salmon (in quasi-mapping mode) and tximport.

The main point that some of us were discussing is what should be the best protocol for using tximport with tools such as Seurat under these analytical conditions? (and perhaps the tximport vignette could benefit from a small section on this like how it has for DESeq2, EdgeR etc.).

A few points I brought up were:

When I searched for tximport and single-cell/Seurat etc..., alevin usually comes up, however, it's important to realize that a lot of the 10X genomics tech, is 3' tagged RNA-seq, and thus does not have the length biases that would be present in Smart-Seq protocol (and thus passing the raw txi$counts as raw counts in data import for Seurat makes perfect sense). Of course, that with Smart-Seq data, we wouldn't even use alevin, but instead just salmon in the same way as it is done with bulk RNA-seq.
Thus, my understanding is that the correct steps for Smart-Seq/full length protocol would be to 1) import the data with the tximport setting countsFromAbundance=lengthScaledTPM which would then result in counts which were normalized for sequencing depth and length and this would be stored in txi$counts which can then 2) be passed on to Seurat's CreateSeuratObject in counts. NOTE I originally had in mind that one would likely want to do this with txOut=FALSE to have gene-level data as I am not quite sure single-cell algorithms are sensitive enough to transcript-level analysis/DE etc... But perhaps this would be a good place to get this confirmation. 3) In Seurat, if one imports txi$counts generated with countsFromAbundance=lengthScaledTPM, then one should likely follow the advice that has been given by the Seurat team if starting with TPMs (this info is from their GitHub issue #668 - don't think the last answer is from a Seurat team member, but it was approved by the satijalab in the reaction) which are to skip the Seurat::NormalizeData() step, but transform the data to log scale (which is stored in object@metadata) prior to ScaleData and also note that log scale in Seurat is natural log.

I believe that captures the main point from the follow-up discussion. Thanks for any advice (special thanks to Michael Love who suggested we post a question here).

tximport Seurat Smart-Seq Smart-Seq2 • 4.4k views

ADD COMMENT • link updated 5.6 years ago by Michael Love 43k • written 5.6 years ago by lshepard ▴ 40

score 3 · Accepted Answer · 2020-07-14

1) Yes, if you want a counts matrix corrected for length bias, you can use one of the two countsFromAbundance options.

Briefly, tximport has three methods for DGE: counts + offset (but if the downstream method does not allow for an offset matrix, this isn't possible), or the two countsFromAbundance options. Either of the latter will work, I slightly prefer lengthScaledTPM because it should result in counts on a closer scale to the original ones.

2) Without knowing the dataset, I can't say either.

You can try importing at transcript level and look at the inferential relative variance (fishpond has an option to compute this). This gives a sense how much uncertainty there is. We've recently looked into methods for incorporating compressed uncertainty into single cell analysis.

3) No, you still need to normalize any of the counts matrices output by tximport for sequencing depth. All of the counts matrices output by tximport are still biased with respect to per-sample sequencing depth variation (this is on purpose, so that the downstream methods have a sense for the variable precision across samples).