Using MAST with DropSeq data
1
0
Entering edit mode
jeremycfd • 0
@jeremycfd-14955
Last seen 3.4 years ago

Hello,

I've been using MAST for analysis of single-cell qPCR data, and I'm familiar with its use for "traditional" single-cell RNAseq data where reads from the full lengths of transcripts are converted to digital gene expression (via counts). I was wondering if anyone had considered any potential issues with using MAST for analysis of single-cell data from platforms like 10X DropSeq, where counts are estimated using UMIs but only from either the 3' or 5' end of transcripts (and never with any data from elsewhere in a transcript). From DropSeq approaches you can get a raw UMI count, and they recommend first filtering unexpressed genes, then normalizing the gene-specific UMI counts by the median number of UMIs obtained from each cell, and taking the log-transformation of the gene/cell matrix (this all seems very similar to what we would do with RSEM or EdgeR).

From my perspective I can't see any obvious issue here, but I wanted to know if anyone else had any thoughts on whether this sort of data might for some reason (perhaps related to the UMI approach, the 5'/3' specific sequencing, or this particular normalization approach) violate assumptions underlying the MAST framework. 

Thanks for reading!

MAST mast dropseq • 897 views
ADD COMMENT
1
Entering edit mode
@andrew_mcdavid-11488
Last seen 4 months ago

You are right that the native distribution of the UMIs (counts) before doing any normalization is rather distinct from that of qPCR.  After some types of normalization, it's not so different.  We've had good luck by calculating counts per million (or ten thousand, as seem to be popular with 10X data) and then log2(CPM + 1) transforming. The normalization question (in my mind) remains somewhat unresolved, but it increasingly seems that considering something a bit more sophisticated than just global scaling may be warranted. Vallejos (2017) and Bacher (2017) help shed some light.

ADD COMMENT
0
Entering edit mode

Hello Andrew,

As a follow-up question, it is technically okay to apply MAST on log2(CPM+1) data right? How do I determine which normalization method to use in general?

Thanks!

ADD REPLY
0
Entering edit mode

Technically the issue is the quality of the normality assumption in the continuous portion of the model.  In my experience the non-zero component of the log2(1+CPM)  has appeared pretty symmetric for droplet technologies, but you could evaluate this yourself informally graphically or formally with tests for symmetry.  As the number of cells considered increases (typical with droplet technologies) the importance of the normality decreases because of the central limit theorem.  In independent evaluations, MAST has been shown to maintain it's advertised level in a range of scenarios, for instance Soneson and Robinson 2018 (https://www.nature.com/articles/nmeth.4612/).

ADD REPLY

Login before adding your answer.

Traffic: 297 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6