Hi:
I am going to ask this question in Bioconductor support site because I tried several ways to do genes filtering by a measuing correlation coefficient, but still no luck with it. So I have no choice to raise my concern here and hearing expert's opinion and community help. I found some nice thread in Biostar but still some issue exists. Any criticism about my question would be appreciated.
let me get straight to my point. I have Affymetrix gene level expression matrix (genes on the rows and sample ID on the columns), and I have annotation data of microarray experiment observation where sample ID in the rows and description identification on the columns.
> head(expr_mat[1:4, 1:3])
Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04
1_at 6.062215 6.125023 5.875502
10_at 3.796484 3.805305 3.450245
100_at 5.849338 6.191562 6.550525
1000_at 3.567779 3.452524 3.316134
> dim(expr_mat)
[1] 120000 832
> head(phenoDat)
SampleID GA Batch Set Train Platform
1 Tarca_001_P1A01 11.0 1 PRB_HTA 1 HTA20
2 Tarca_013_P1B01 15.3 1 PRB_HTA 1 HTA20
3 Tarca_025_P1C01 21.7 1 PRB_HTA 1 HTA20
4 Tarca_037_P1D01 26.7 1 PRB_HTA 1 HTA20
5 Tarca_049_P1E01 31.3 1 PRB_HTA 1 HTA20
6 Tarca_061_P1F01 32.1 1 PRB_HTA 1 HTA20
> dim(phenoDat)
[1] 832 6
I intend to see how the genes in each sample from expr_mat
are correlated with GA
value of corresponding samples in phenoDat
. I learned a few threads from Biostar and using phenoDat@GA
as a covariate, I come up following simple solution to fit the linear model in order to see the correlation. Here is what I tried by using limma user guide:
library(limma)
fit <- limma::lmFit(exprs_mat, design = model.matrix( ~ 0 + t(phenoDat$GA))
fit <- eBayes(fit)
topTable(fit, coef=2)
but I got dimension error because of my way of constructing the design matrix was wrong. I don't know how to build a proper design matrix and find highly correlated genes by using lmFit
from limma
. Any thought?
goal:
I just want to keep the genes from exprs_mat
which has high correlated with phenoDat$GA
by sample-wise. Essentially, I am expecting to get a correlation matrix and to subset original expression matrix with threshold, something like this:
cors <- stats::cor(t(exprs_mat))
idx <- which( (abs(cors) > 0.8) & (upper.tri(cors)), arr.ind=TRUE)
idx <- unique(c(idx[, 1],idx[, 2])
correlated.genes <- exprs_mat[idx, ]
How can I do that? any possible approach to get this done? Thanks