Question

Association analysis between gene expression and Phenotype

1

Entering edit mode

hrishi27n ▴ 20

@hrishi27n-11821

Last seen 2.9 years ago

United States

Hello All,

I have RNA data generated using the smart-seq2 protocol and I am interested in running association analysis (some thing like logistic regression but with limma) between the gene expression values and cohort status. I have quantified the transcripts as TPM using RSEM. I am considering normalizing the data with TMM, estimating mean-variance relationship using voomWithQualityWeights and then using lmFit and eBayes to estimate the association.

I am not interested in seeing difference between cohort but to predict cohort status using gene expression values.

Following is my dummy code:

 design <- model.matrix(~0+Cohort+sex+age, data=PhenoType)   #Cohort:1/0
 v <- voomWithQualityWeights(myNormalized_data, design=design, normalization="none", plot=TRUE)
 fit1 <- lmFit(v,design)
 fit2 <- eBayes(fit1)

All help and comments are highly appreciated.

Thanks.

rna singlecell limma • 1.5k views

ADD COMMENT • link updated 5.3 years ago by Aaron Lun ★ 28k • written 5.3 years ago by hrishi27n ▴ 20

0

Entering edit mode

I’m removing the DESeq2 tag here.

ADD REPLY • link 5.3 years ago Michael Love 42k

0

Entering edit mode

You're going to have to be more specific here. Put down some code and scientific context.

ADD REPLY • link 5.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Aaron, thank you for responding to my post. I have updated the post to include more info and code. Appreciate all your help and comments.

ADD REPLY • link 5.3 years ago hrishi27n ▴ 20

score 4 · Accepted Answer · 2019-03-30

I'll assume this is single-cell RNA-seq data.

I have quantified the transcripts as TPM using RSEM.

You shouldn't use TPMs as input into voom, as stated here. Get the raw counts instead.

I am considering normalizing the data with TMM,

TMM may not play well with single-cell data if you have too many zeros. After all, M-values are not defined if either of the inputs is a zero. See here for details.

estimating mean-variance relationship using voomWithQualityWeights

The mean-variance relationship of the log-expression values is very wonky for single-cell data, see Figure 8 for an example. As a result, very high and very low coverage libraries have undue influence on the result because their variance is so much lower than the cells with moderate coverage.

and then using lmFit and eBayes to estimate the association.

I don't know what you mean by "estimate the association". Do you mean to detect differentially expressed genes between cohort?

I am not interested in seeing difference between cohort but to predict cohort status using gene expression values.

Well, then, you probably should be using a classifier rather than performing a differential expression analysis!

However, if you do want to identify DE genes, then I would suggest creating "pseudo-bulk" samples, as discussed here. This allows you to plug it into existing DE analysis pipelines without having to worry about single-cell variability. The full statistical reasoning is a bit more subtle, but you can read the linked article for that.