In addition to find DEGs, I was hoping to using RNA-seq count data to do correlation analysis (Pearson correlation) between gene expression level and a specific phenotype across samples. In order to do that, I have to extract count info (as an indicator of gene expression level) of my interest genes. I used EdgeR and after creating the raw count matrix, I followed the steps:
#Filtering
keep <- filterByExpr(y)
table(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
dim(y)
#Apply TMM (trimmed mean of M-values) normalization to normalise gene expression distributions and eliminate the composition biases between libraries
y <- calcNormFactors(y,method = "TMM")
y$counts
Is the count table from y$counts
the right one I can use for further correlation analysis?
Thanks Gordon, I followed the edgeR pipeline and have ready got my DEG list. The DEGs were generated from
design <- model.matrix(~conditions)
and the condition is a categorical variable (Control vs. Treatment), not numeric. Just to clarify that if I am still interested in correlation between gene expression and a numeric phenotype (say a particular hormone concentration) in the sample, and this numeric phenotype was not included inmodel.matrix
before. Is getting cpms still the best approach for this correlation analysis? Is cmps the normalised count table?My answer is the same today as it was yesterday. Just put phenotype in the design matrix, for example by