Search
Question: Normalizing to multiple spike-in concentrations with scran
0
9 weeks ago by
supremerulersuraj0 wrote:

I have a single cell RNA-seq dataset acquired using a UMI protocol with ERCC spike-ins. I would like to normalize to spike-ins (to be able to maintain information about endogenous differences in mRNA). I was planning to do this in scran through computeSpikeFactors. However, it happened during sample prep that one (known) group of cells received twice as much spike-in as the others. Would it be possible for me to account for this fact during the normalization step?

Thanks!

modified 5 weeks ago • written 9 weeks ago by supremerulersuraj0
1
9 weeks ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

Just divide the size factors for the affected cells by 2.

This is most obviously valid when no library quantification is performed, i.e., you did not force each cell to contribute equal amounts of cDNA prior to multiplexing. In this case, twice as much spike-in RNA should result in twice as much spike-in coverage in the affected cells, and thus size factors that are twice as large. Dividing these size factors by two will then bring everything back to the same scale.

If you did do library quantification, then the reasoning becomes more complicated, as twice as much spike-in RNA will not lead to size factors that are twice as large (due to composition effects). Nonetheless, division is still valid here as the composition effects affect both the spike-in RNA and the endogenous genes. This means that they cancel out upon normalization; the ultimate effect of having twice as much spike-in RNA would be to halve the normalized expression of the endogenous genes. You can again fix this by dividing the size factors by 2 before normalization.

That being said; seeing different amounts of spike-ins in a dataset is usually a red flag for me, as it is symptomatic of other experimental factors being different between these cells and the others (that your collaborators have not told you about). In such cases, it is likely that you would have to do batch correction anyway, e.g., with removeBatchEffect() or even better mnnCorrect().

Also, if you are planning to use trendVar() on the log-normalized values, I would strongly advise you to run it separately on the cells with different amounts of spike-in, and then combine the results in the end with combineVar(). This is because the technical mean-variance trend will fundamentally differ between the cells with 1x and 2x spike-in (the latter will have the trend shifted to the right), making it impossible to estimate a sensible trend from a data set where they are combined.

0
5 weeks ago by
supremerulersuraj0 wrote:

Thanks for the explanation. We luckily have cells from the same biological group in another batch that received the 1x spike-in. When looking at PCA/t-SNE, they cluster by biological group rather than batch, which is promising, but we will definitely do batch correction as you suggested to be sure.