Question: Association analysis between gene expression and Phenotype
1
8 weeks ago by
hrishi27n20
hrishi27n20 wrote:

Hello All,

I have RNA data generated using the smart-seq2 protocol and I am interested in running association analysis (some thing like logistic regression but with limma) between the gene expression values and cohort status. I have quantified the transcripts as TPM using RSEM. I am considering normalizing the data with TMM, estimating mean-variance relationship using voomWithQualityWeights and then using lmFit and eBayes to estimate the association.

I am not interested in seeing difference between cohort but to predict cohort status using gene expression values.

Following is my dummy code:

 design <- model.matrix(~0+Cohort+sex+age, data=PhenoType)   #Cohort:1/0
v <- voomWithQualityWeights(myNormalized_data, design=design, normalization="none", plot=TRUE)
fit1 <- lmFit(v,design)
fit2 <- eBayes(fit1)


All help and comments are highly appreciated.

Thanks.

limma rna singlecell • 170 views
modified 7 weeks ago by Aaron Lun23k • written 8 weeks ago by hrishi27n20

I’m removing the DESeq2 tag here.

You're going to have to be more specific here. Put down some code and scientific context.

Aaron, thank you for responding to my post. I have updated the post to include more info and code. Appreciate all your help and comments.

Answer: Association analysis between gene expression and Phenotype
4
7 weeks ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

I'll assume this is single-cell RNA-seq data.

I have quantified the transcripts as TPM using RSEM.

You shouldn't use TPMs as input into voom, as stated here. Get the raw counts instead.

I am considering normalizing the data with TMM,

TMM may not play well with single-cell data if you have too many zeros. After all, M-values are not defined if either of the inputs is a zero. See here for details.

estimating mean-variance relationship using voomWithQualityWeights

The mean-variance relationship of the log-expression values is very wonky for single-cell data, see Figure 8 for an example. As a result, very high and very low coverage libraries have undue influence on the result because their variance is so much lower than the cells with moderate coverage.

and then using lmFit and eBayes to estimate the association.

I don't know what you mean by "estimate the association". Do you mean to detect differentially expressed genes between cohort?

I am not interested in seeing difference between cohort but to predict cohort status using gene expression values.

Well, then, you probably should be using a classifier rather than performing a differential expression analysis!

However, if you do want to identify DE genes, then I would suggest creating "pseudo-bulk" samples, as discussed here. This allows you to plug it into existing DE analysis pipelines without having to worry about single-cell variability. The full statistical reasoning is a bit more subtle, but you can read the linked article for that.