Association analysis between gene expression and Phenotype
1
1
Entering edit mode
hrishi27n ▴ 20
@hrishi27n-11821
Last seen 2.6 years ago
United States

Hello All,

I have RNA data generated using the smart-seq2 protocol and I am interested in running association analysis (some thing like logistic regression but with limma) between the gene expression values and cohort status. I have quantified the transcripts as TPM using RSEM. I am considering normalizing the data with TMM, estimating mean-variance relationship using voomWithQualityWeights and then using lmFit and eBayes to estimate the association.

I am not interested in seeing difference between cohort but to predict cohort status using gene expression values.

Following is my dummy code:

 design <- model.matrix(~0+Cohort+sex+age, data=PhenoType)   #Cohort:1/0
 v <- voomWithQualityWeights(myNormalized_data, design=design, normalization="none", plot=TRUE)
 fit1 <- lmFit(v,design)
 fit2 <- eBayes(fit1)

All help and comments are highly appreciated.

Thanks.

rna singlecell limma • 1.3k views
ADD COMMENT
0
Entering edit mode

I’m removing the DESeq2 tag here.

ADD REPLY
0
Entering edit mode

You're going to have to be more specific here. Put down some code and scientific context.

ADD REPLY
0
Entering edit mode

Aaron, thank you for responding to my post. I have updated the post to include more info and code. Appreciate all your help and comments.

ADD REPLY
4
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 30 minutes ago
The city by the bay

I'll assume this is single-cell RNA-seq data.

I have quantified the transcripts as TPM using RSEM.

You shouldn't use TPMs as input into voom, as stated here. Get the raw counts instead.

I am considering normalizing the data with TMM,

TMM may not play well with single-cell data if you have too many zeros. After all, M-values are not defined if either of the inputs is a zero. See here for details.

estimating mean-variance relationship using voomWithQualityWeights

The mean-variance relationship of the log-expression values is very wonky for single-cell data, see Figure 8 for an example. As a result, very high and very low coverage libraries have undue influence on the result because their variance is so much lower than the cells with moderate coverage.

and then using lmFit and eBayes to estimate the association.

I don't know what you mean by "estimate the association". Do you mean to detect differentially expressed genes between cohort?

I am not interested in seeing difference between cohort but to predict cohort status using gene expression values.

Well, then, you probably should be using a classifier rather than performing a differential expression analysis!

However, if you do want to identify DE genes, then I would suggest creating "pseudo-bulk" samples, as discussed here. This allows you to plug it into existing DE analysis pipelines without having to worry about single-cell variability. The full statistical reasoning is a bit more subtle, but you can read the linked article for that.

ADD COMMENT
0
Entering edit mode

Aaron, Thank you for your input. This is very helpful.

ADD REPLY

Login before adding your answer.

Traffic: 784 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6