Hi!
With bulk-RNA I am used to using STAR-RSEM (gene-level)-tximport-DESeq2 as a standard workflow; it is well documented and via tximport the effective_length from RSEM is also 'taken into account'. Now, I am working with scRNA and I am getting confused with how to import the RSEM counts for analysis; looking at the source-code of tximport, I don't see the effective_length being used for anything other than just being included as a variable in the resulting object.
I would like to know when, where and how is the effective_length information from RSEM incorporated? How/if this benefits scRNA, or what is the advised way of using RSEM count estimates in single-cell RNAseq?
data: full-length data; smart-seq2
I am very grateful for any input on this, thanks!
ATpoint is correct -- it's not enough to say single-cell. In the tximport vignette we have a note about transcript length correction for 3' tagged scRNA-seq, which is worth a read:
<https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html
Hi!
Thank you both for the comments. It is indeed full-length (I have updated the question).
It depends on the downstream tool, but the default pipelines in tximport or tximeta vignettes incorporate the effective length as a statistical offset (in DESeq2, edgeR for example). What tool were you planning on using downstream of import? Are you planning gene or isoform level analysis?
Thank you. Gene-level analysis. I don't yet have a specific downstream pipeline in mind - I am trying to understand what limitations I might run into, or what tools I can/can't use. My intention with this question was just to get a better understanding of processes that happen between RSEM-tximport-DESeq2 in the context of scRNA; am I correct in interpreting that tximport itself doesn't use the length information, but this is used within DESeq2's normalization?
More specific context, which prompted me to look into it: I am following the OSCA books for analysis guidance (tximported RSEM counts; smartseq2; no spike-in), and my normalization results with scran keep coming back wonky (I end up introducing a plate-wise batch effect on the Tsne for plates that consist of the same cell clones (essentially "replicates"; I have a number of clones and cells from each clone were sorted onto two plates), which is not present in Tsne for non-normalized counts). So I though perhaps there are upstream steps I am not counting for e.g. length-offset during normalization, like with bulk data through DESeq2.
tximport imports the length information, then if you see the pipelines in the vignette, the information is passed off to the different packages for it to be included in the offset in the GLM. Likewise for tximeta and its vignette.
I don't think the effective length will be the cause for the issue you observe. Typically the offset doesn't do very much (because there is not drastic and systematic isoform switching in most experiments). The offset comes into play when in fact there is dramatic isoform switching, so that gene-level differential expression analysis can be performed in a robust manner.
Perfect, thanks!
Thanks ATpoint! The data is full-length, but regardless of this, the bits that confuse me (highlighted questions) are quite general I think.