Question

Pearson correlation between gene expression and phenotype

0

Entering edit mode

weichengz ▴ 10

@weichengz-23557

Last seen 3.4 years ago

Melbourne, Australia

In addition to find DEGs, I was hoping to using RNA-seq count data to do correlation analysis (Pearson correlation) between gene expression level and a specific phenotype across samples. In order to do that, I have to extract count info (as an indicator of gene expression level) of my interest genes. I used EdgeR and after creating the raw count matrix, I followed the steps:


#Filtering
keep <- filterByExpr(y)

table(keep)

y <- y[keep, , keep.lib.sizes=FALSE]

dim(y)

#Apply TMM (trimmed mean of M-values) normalization to normalise gene expression distributions and eliminate the composition biases between libraries
y <- calcNormFactors(y,method = "TMM")
y$counts

Is the count table from y$counts the right one I can use for further correlation analysis?

edgeR RNASeq • 2.2k views

ADD COMMENT • link updated 3.4 years ago by Gordon Smyth 50k • written 3.4 years ago by weichengz ▴ 10

score 0 · Answer 1 · 2020-11-19

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

The whole purpose of edgeR is to correlate gene expression with phenotype, so to do so you just follow the usual edgeR pipeline. If the phenotype is numeric, then you simply create a design matrix:

design <- model.matrix(~phenotype)

and find DE genes in the usual edgeR way. The genes that are DE are correlated with the phenotype.

Your proposal to compute Pearson correlations is an ad hoc way of doing the same thing. If you wanted to do that, you certainly couldn't use the count matrix, you'd use cpms. See the User's Guide for how to compute cpms.

ADD COMMENT • link 3.4 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks Gordon, I followed the edgeR pipeline and have ready got my DEG list. The DEGs were generated from design <- model.matrix(~conditions)and the condition is a categorical variable (Control vs. Treatment), not numeric. Just to clarify that if I am still interested in correlation between gene expression and a numeric phenotype (say a particular hormone concentration) in the sample, and this numeric phenotype was not included in model.matrix before. Is getting cpms still the best approach for this correlation analysis? Is cmps the normalised count table?

ADD REPLY • link 3.4 years ago weichengz ▴ 10

0

Entering edit mode

My answer is the same today as it was yesterday. Just put phenotype in the design matrix, for example by

design <- model.matrix(~conditions+phenotype)

ADD REPLY • link 3.4 years ago Gordon Smyth 50k